重新获得代码库的控制权
Regaining Control of Your Codebase
版权所有 © 2021 Maude Lemaire。保留所有权利。
Copyright © 2021 Maude Lemaire. All rights reserved.
在美国印刷。
Printed in the United States of America.
由O'Reilly Media, Inc.出版,地址为 1005 Gravenstein Highway North, Sebastopol, CA 95472。
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.
O'Reilly 的书籍可用于教育、商业或促销用途。大多数书籍都有在线版本 ( http://oreilly.com )。如需更多信息,请联系我们的企业/机构销售部门:800-998-9938 或corporate@oreilly.com。
O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com.
有关发布详细信息,请参阅http://oreilly.com/catalog/errata.csp?isbn=9781492075530 。
See http://oreilly.com/catalog/errata.csp?isbn=9781492075530 for release details.
O'Reilly 徽标是 O'Reilly Media, Inc. 的注册商标。Refactoring at Scale、封面图片和相关商业外观是 O'Reilly Media, Inc. 的商标。
The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Refactoring at Scale, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc.
本作品中表达的观点为作者的观点,不代表出版商的观点。尽管出版商和作者已尽最大努力确保本作品中包含的信息和说明准确无误,但出版商和作者对错误或遗漏不承担任何责任,包括但不限于因使用或依赖本作品而造成的损害的责任。使用本作品中包含的信息和说明的风险由您自行承担。如果本作品包含或描述的任何代码示例或其他技术受开源许可或他人的知识产权约束,则您有责任确保您对其的使用符合此类许可和/或权利。
The views expressed in this work are those of the author, and do not represent the publisher’s views. While the publisher and the author have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights.
978-1-492-07553-0
978-1-492-07553-0
[大规模集成电路]
[LSI]
虽然有很多关于重构的书,但大多数都是逐行改进小段代码的细节。我认为重构最困难的部分通常不是找到改进手头代码的精确方法,而是围绕它需要发生的所有其他事情。事实上,我甚至可以说,对于任何大型软件项目来说,小事情很少重要;协调复杂的变更才是最大的挑战。
While there are a number of books about refactoring, most of them deal with the nitty-gritty of improving small bits of code one line at a time. I believe that the most difficult part of a refactor is usually not finding the precise way to improve the code at hand, but rather everything else that needs to happen around it. In fact, I might also go so far as to say that for any large software project, the little things rarely matter; coordinating complex changes is the biggest challenge of all.
《大规模重构》是我帮助您解决这些难题的尝试。这是我多年执行各种规模的重构项目经验的结晶。在 Slack 工作期间,我领导的许多项目使公司规模大幅扩大;我们的产品从能够支持 25,000 名员工的客户发展到支持 500,000 名员工的客户。我们为有效重构而制定的策略需要承受爆炸性的组织增长,我们的工程团队在同一时期增长了近六倍。成功规划和执行一个影响很大一部分代码库和越来越多工程师的项目绝非易事。我希望这本书能为您提供实现这一目标所需的工具和资源。
Refactoring at Scale is my attempt at helping you figure out those difficult pieces. It’s the culmination of many years of experience carrying out all sorts of refactoring projects of various scales. During my time at Slack, many of the projects I’ve led have allowed the company to scale dramatically; our product has gone from being able to support customers with 25,000 employees to those with a whopping 500,000. The strategies we developed to refactor effectively needed to tolerate explosive organizational growth, with our engineering team growing nearly sixfold during the same period. Successfully planning and executing on a project that affects both a significant portion of your codebase and an increasing number of engineers is no small feat. I hope this book gives you the tools and resources you need to do just that.
如果您与数十名(或更多)其他工程师一起在大型、复杂的代码库中工作,那么这本书适合您!
If you work in a large, complex codebase alongside dozens (or more) of other engineers, this book is for you!
如果您是一名初级工程师,希望通过在公司中有所作为来开始培养更多高级技能,那么大规模的重构工作可能是实现这一目标的好方法。这类项目具有广泛而有意义的影响,远远超出了您的个人团队。(它们也不是那么光鲜,高级工程师可能会立即抓住它。)它们是您获得新专业技能(并加强您已有的技能)的绝佳机会。本书将教您如何从头到尾顺利完成此类项目。
If you’re a junior engineer seeking ways to start building more senior skills by making a difference at your company, a large refactoring effort can be a great way to achieve that. These kinds of projects have broad, meaningful impact extending well beyond your individual team. (They’re also not so glamorous that a senior engineer might snap it up right away.) They’re a great opportunity for you to acquire new professional skills (and strengthen the ones you already have). This book will teach you how to navigate this kind of project smoothly from start to finish.
这本书对于技术精湛的高级工程师来说也是一本宝贵的资源,他们可以通过编程解决任何问题,但对其他人不理解他们工作的价值感到沮丧。如果你感到孤独,正在寻找提升周围人的方法,这本书可以教你一些策略,让你帮助别人通过你的眼光看待重要的技术问题。
This book is also a valuable resource for highly technical senior engineers who can code themselves out of any problem, but are feeling frustrated that others aren’t understanding the value of their work. If you’re feeling isolated and are looking for ways to level-up others around you, this book can teach you the strategies you need to help others see important technical problems through your eyes.
对于寻求帮助指导团队完成大规模重构的技术经理来说,本书可以帮助您了解如何在每一步中更好地支持您的团队。这些页面中没有大量的技术内容,因此如果您以任何身份(工程经理、产品经理、项目经理)参与大规模重构,您可以从本文的想法中受益。
For the technical managers seeking to help guide their team through a large-scale refactor, this book can help you understand how to better support your team every step of the way. There isn’t a substantial amount of technical content contained within these pages, so if you are involved with a large-scale refactor in just about any capacity (engineering manager, product manager, project manager), you can benefit from the ideas herein.
当我第一次着手进行大规模重构时,我明白了代码需要更改的原因以及需要如何更改,但最让我困惑的是如何安全、逐步地引入这些更改,而不冒犯其他人。我渴望产生跨职能影响,并没有停下来考虑重构可能对其他人的工作产生的影响,也没有考虑如何激励他们帮助我完成重构。我只是一鼓作气。(您可以在 第 10 章中阅读有关此重构的内容!)
When I set out on my first large-scale refactor, I understood why the code needed to change and how it needed to change, but what puzzled me most was how to introduce those changes safely, gradually, and without stepping on everyone else’s toes. I was eager to have cross-functional impact and didn’t pause to acknowledge the ramifications the refactor might have on others’ work, nor how I might motivate them to help me complete it. I simply plowed through. (You can read about this refactor in Chapter 10!)
在接下来的几年里,我重构了更多行代码,但最终还是遭遇了一些糟糕的重构。我从这些经历中学到的经验教训非常重要,因此我开始在多个会议上谈论这些经验教训。我的演讲引起了数百名工程师的共鸣,他们和我一样,在自己公司内重构大量代码时都遇到了问题。很明显,我们的软件教育存在某种差距,特别是在专业编写软件这一核心方面。
In the years that followed, I refactored many, many more lines of code and ended up on the receiving end of a few ill-executed refactors. The lessons I’d learned from these experiences felt important, so I began speaking about them at a number of conferences. My talks resonated with hundreds of engineers, all of whom, like me, had experienced problems effectively refactoring large surface areas of code within their own companies. It seemed clear that there was some sort of gap in our software education, specifically around this core aspect of what it means to write software professionally.
这本书在很多方面都试图教授典型计算机科学课程未涵盖的重要内容,只是因为这些内容在课堂上太难教授。也许这些内容也无法在书中教授,但为什么不尝试一下呢?
In many ways, this book attempts to teach the important things that aren’t covered in a typical computer science curriculum, simply because they are too difficult to teach in a classroom. Perhaps they cannot be taught in a book either, but why not give it a try?
本书分为四个部分,按照规划和执行大规模重构所需工作的大致时间顺序进行组织,概述如下。
This book is split into four parts and organized in rough chronological order of the work required to plan and execute a large-scale refactor, outlined as follows.
第一部分介绍了重构背后的重要概念。
Part I introduces important concepts behind refactoring.
第二部分涵盖了规划成功重构所需了解的所有内容。
Part II covers everything you need to know about planning a successful refactor.
Chapter 3 provides an overview of the many metrics you can use to measure the problems your refactor seeks to solve before any improvements are made.
Chapter 4 explains the important components of a comprehensive execution plan and how to go about drafting one.
Chapter 5 discusses different approaches to get engineering leadership to support your refactor.
Chapter 6 describes how to identify which engineers are best suited to work on the refactor and tips for recruiting them.
第三部分重点介绍您可以采取哪些措施来确保重构过程中顺利进行。
Part III focuses on what you can do to make sure that your refactor goes well while it is underway.
Chapter 7 explores how best to promote good communication within your team and with any external stakeholders.
Chapter 8 looks at a number of ways to maintain momentum throughout the refactor.
Chapter 9 provides a few suggestions for how to ensure that the changes introduced by your refactor stick around.
第四部分包含两个案例研究,均来自我在 Slack 工作期间参与的项目。这些重构影响了我们核心应用程序的很大一部分,而且规模非常大。我希望这些有助于说明本书第一至第三部分讨论的概念。
Part IV contains two case studies, both pulled from projects I was involved with while working at Slack. These refactors affected a significant portion of our core application, truly at scale. I hope these will help illustrate the concepts discussed in Parts I–III of the book.
这种顺序并不是规定性的;即使我们已经进入了新阶段,也并不意味着我们不应该在必要时重新审视之前的假设。例如,您可能在开始重构时对将要合作的团队有着强烈的意识,但在起草执行计划的过程中才发现您需要招募比最初预期更多的工程师。没关系;这种情况总是会发生!
This ordering is not prescriptive; just because we’ve reached a new phase doesn’t mean we shouldn’t revisit our previous assumptions if necessary. For example, you might be kicking off your refactor with a strong sense of the team you’ll be working with, only to discover halfway through drafting your execution plan that you’ll need to bring in more engineers than you had initially anticipated. That’s ok; it happens all the time!
本书采用了以下印刷约定:
The following typographical conventions are used in this book:
表示新术语、URL、电子邮件地址、文件名和文件扩展名。
Indicates new terms, URLs, email addresses, filenames, and file extensions.
Constant widthConstant width用于程序列表,以及段落内引用程序元素,例如变量或函数名称、数据库、数据类型、环境变量、语句和关键字。
Used for program listings, as well as within paragraphs to refer to program elements such as variable or function names, databases, data types, environment variables, statements, and keywords.
这个元素表示提示或建议。
This element signifies a tip or suggestion.
此元素表示一般说明。
This element signifies a general note.
补充材料(代码示例、练习等)可在https://github.com/qcmaude/refactoring-at-scale下载。
Supplemental material (code examples, exercises, etc.) is available for download at https://github.com/qcmaude/refactoring-at-scale.
本书旨在帮助您完成工作。一般而言,如果本书提供了示例代码,您可以在程序和文档中使用它。除非您要复制大量代码,否则无需联系我们获取许可。例如,编写使用本书中几段代码的程序无需获得许可。销售或分发 O'Reilly 书籍中的示例则需要获得许可。通过引用本书并引用示例代码来回答问题无需获得许可。将本书中的大量示例代码合并到您的产品文档中则需要获得 许可。
This book is here to help you get your job done. In general, if example code is offered with this book, you may use it in your programs and documentation. You do not need to contact us for permission unless you’re reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing examples from O’Reilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a significant amount of example code from this book into your product’s documentation does require permission.
我们欢迎但并不要求注明出处。注明出处通常包括书名、作者、出版商和 ISBN。例如:“《Refactoring at Scale》作者:Maude Lemaire (O'Reilly)。版权所有 2021 Maude Lemaire,978-1-492-07553-0。”
We appreciate, but do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: “Refactoring at Scale by Maude Lemaire (O’Reilly). Copyright 2021 Maude Lemaire, 978-1-492-07553-0.”
如果您认为您对代码示例的使用超出了合理使用或上述许可的范围,请随时通过permissions@oreilly.com与我们联系。
If you feel your use of code examples falls outside fair use or the permission given above, feel free to contact us at permissions@oreilly.com.
40 多年来,O'Reilly Media一直提供技术和商业培训、知识和见解,帮助企业取得成功。
For more than 40 years, O’Reilly Media has provided technology and business training, knowledge, and insight to help companies succeed.
我们独特的专家和创新者网络通过书籍、文章和我们的在线学习平台分享他们的知识和专长。O'Reilly 的在线学习平台让您可以按需访问现场培训课程、深入的学习路径、交互式编码环境以及来自 O'Reilly 和 200 多家其他出版商的大量文本和视频。有关更多信息,请访问http://oreilly.com。
Our unique network of experts and innovators share their knowledge and expertise through books, articles, and our online learning platform. O’Reilly’s online learning platform gives you on-demand access to live training courses, in-depth learning paths, interactive coding environments, and a vast collection of text and video from O’Reilly and 200+ other publishers. For more information, visit http://oreilly.com.
请将有关本书的评论和问题发送给出版商:
Please address comments and questions concerning this book to the publisher:
我们为本书建立了一个网页,其中列出了勘误表、示例和任何其他信息。您可以通过https://oreil.ly/refactoring-at-scale访问此页面。
We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at https://oreil.ly/refactoring-at-scale.
发送电子邮件至bookquestions@oreilly.com以发表评论或询问有关本书的技术问题。
Email bookquestions@oreilly.com to comment or ask technical questions about this book.
有关我们的书籍和课程的新闻和信息,请访问http://oreilly.com。
For news and information about our books and courses, visit http://oreilly.com.
在 Facebook 上找到我们:http://facebook.com/oreilly
Find us on Facebook: http://facebook.com/oreilly
在 Twitter 上关注我们:http ://twitter.com/oreillymedia
Follow us on Twitter: http://twitter.com/oreillymedia
在 YouTube 上观看我们:http://youtube.com/oreillymedia
Watch us on YouTube: http://youtube.com/oreillymedia
写书并非易事,这本书也不例外。如果没有许多人的贡献,《大规模重构》就不可能问世。
Writing a book is not an easy task, and this one was no exception. Refactoring at Scale would not have been possible without the contributions of many people.
首先,我要感谢我在 O'Reilly 的编辑 Jeff Bleiel。Jeff 把一个没有经验的作家(我)变成了一位出版作家。他的反馈总是恰到好处,帮助我更有条理地组织我的想法,并鼓励我在冗长的内容(这种情况经常发生)时删掉。我简直无法想象能有比他更好的编辑。
First, I’d like to thank my editor at O’Reilly, Jeff Bleiel. Jeff turned an inexperienced writer (me) into a published author. His feedback was always spot-on, helping me organize my thoughts more cohesively, and encouraging me to cut whenever I was being too wordy (which is something that happened quite frequently). I simply can’t imagine working with a better editor.
其次,我要感谢几位阅读过几章早期版本的好友和同事:Morgan Jones、Ryan Greenberg 和 Jason Liszka。他们的反馈让我确信我的想法是正确的,对广大读者都有价值。感谢 Joann、Kevin、Chase 和 Ben 的鼓励和发人深省的对话。
Second, I want to thank the handful of friends and colleagues who read early versions of a few chapters: Morgan Jones, Ryan Greenberg, and Jason Liszka. Their feedback assured me that my ideas were sound and would be valuable to a wide range of readers. For the words of encouragement and thought-provoking conversations, thanks go to Joann, Kevin, Chase, and Ben.
我要感谢 Maggie Zhou 在撰写案例研究第二章(第 11 章)时提供的所有帮助。她是我有幸共事过的最体贴、最聪明、最有活力的同事之一,我很高兴全世界都能阅读我们一起经历的冒险!
I’d like to thank Maggie Zhou for all her help cowriting the second case study chapter (Chapter 11). She is one of the most thoughtful, intelligent, energetic coworkers I’ve ever had the pleasure to work with and I’m thrilled for the world to read about our adventures together!
非常感谢我的技术审阅者 David Cottrell 和 Henry Robinson。David 自大学以来一直是我的密友,在 Google 工作多年,他领导过许多大规模重构。后来他创立了自己的公司。Henry 是 Slack 的同事,他为开源做出了无数贡献,亲眼目睹了硅谷公司的爆炸式增长。他们都是极其认真的工程师,本书从他们的指导和智慧中受益匪浅。我无比感激他们花费大量时间来核实内容。最终手稿中的任何不准确之处都是我自己的错误。
A huge thank you to my technical reviewers, David Cottrell and Henry Robinson. David has been a close friend since university and has led a number of large-scale refactors in his many years at Google. He’s since founded his own company. Henry is a colleague at Slack who’s made countless open-source contributions and seen explosive growth at Silicon Valley companies firsthand. They are both incredibly conscientious engineers, and the book greatly benefited from their guidance and wisdom. I am endlessly grateful for the many hours they spent verifying its contents. Any inaccuracies in the final manuscript are mistakes of my own.
感谢所有和我一起重构过一些东西的人。你们人数太多,无法一一列举,但你们知道自己是谁。你们都参与了本书思想的形成。
Thank you to everyone who’s ever refactored something with me. There are too many of you to name, but you know who you are. You all have had a hand in shaping the ideas in this book.
感谢我的家人(Simon、Marie-Josée、François-Rémi、Sophie、Sylvia、Gerry、Stephanie 和 Celia)在场边为我加油。
Thank you to my family (Simon, Marie-Josée, François-Rémi, Sophie, Sylvia, Gerry, Stephanie, and Celia) for cheering me on from the sidelines.
最后,感谢我的丈夫艾弗里。谢谢你的耐心,谢谢你给我时间、空间和鼓励来写作。谢谢你让我利用无数个下午来讨论一两个(或三四个)想法。谢谢你相信我。这本书既是你的,也是我的。我爱你。
Finally, thank you to my husband, Avery. Thank you for your patience, for giving me the time, space, and encouragement to write. Thank you for letting me hijack countless afternoons to talk through an idea or two (or three or four). Thank you for believing in me. This book is just as much yours as it is mine. I love you.
有人曾经问我为什么这么喜欢重构。是什么让我在工作中如此频繁地回到这类项目?我告诉她,重构有一种让人上瘾的感觉。也许是简单的整理行为,比如整齐地分类和排列你的香料;也许是清理杂物并最终丢弃某件物品的乐趣,比如把一袋遗忘的衣服送到 Goodwill;或者也许是我脑海中的一个声音在提醒我,这些微小的、渐进的变化将显著改善我同事的日常生活。我想是所有这些因素的结合。
Someone once asked me what it was that I liked so much about refactoring. What kept me coming back to these types of projects at work so often? I told her that there was something addicting about it. Maybe it’s the simple act of tidying, like neatly cataloging and ordering your spices; or maybe it’s the joy of decluttering and finally deprecating something, like bringing a bag of forgotten clothes to Goodwill; or maybe yet it’s the little voice in my head reminding me that these tiny, incremental changes will amount to a significant improvement in my colleagues’ daily lives. I think it’s the combination of it all.
重构行为中有一些对我们所有人都有吸引力的东西,无论是构建新产品功能还是扩展基础架构。我们都必须在编写更多代码和更少代码之间取得平衡。我们必须努力了解我们所做的更改(无论是有意还是无意的)的下游影响。代码是有生命的,会呼吸的。当我想到我编写的代码将在未来五年、十年内继续存在时,我不禁有点畏缩。我当然希望到那时,有人会来将它完全删除或用更干净的代码替换它,最重要的是,更适合当时应用程序需求的代码。这就是重构的意义所在。
There’s something in the act of refactoring that can appeal to us all, whether we’re building new product features or working on scaling an infrastructure. We all must strike a balance in our work between writing more or writing less code. We must strive to understand the downstream effects of our changes, whether intentional or not. Code is a living, breathing thing. When I think about the code that I’ve written living on for another five, ten years, I can’t help but wince a little bit. I certainly hope that by that time, someone will have come along and either removed it entirely or replaced it with something cleaner and, most importantly, more suited to the needs of the application at that time. This is what refactoring is all about.
在本章中,我们将首先定义一些概念。我们将提出一般情况下重构的基本定义,并在此基础上为大规模重构制定单独的定义。为了阐明本书的一些动机,我们将讨论为什么我们应该关心重构,以及如果我们磨练了这项技能,我们可以为我们的团队带来哪些优势。接下来,我们将深入探讨重构的一些好处,以及在考虑是否进行重构时应该牢记的一些风险。根据我们对权衡利弊的了解,我们将考虑一些时机成熟和时机不合适的场景。最后,我们将通过一个简短的示例来生动地展示这些概念。
In this chapter, we’ll start by defining a few concepts. We’ll propose a basic definition for refactoring in the general case and build on top of it to develop a separate definition for refactoring at scale. To frame some of the motivations of this book, we’ll discuss why we should care about refactoring and what advantages we can bring to our teams if we’ve honed this skill. Next, we’ll dive into some of the benefits to expect from refactoring and some of the risks we should keep in mind when considering whether to do it. With our knowledge of the trade-offs, we’ll consider some scenarios when the time is right and when the time is wrong. Finally, we’ll walk through a short example to bring these concepts to life.
简而言之,重构就是我们在不改变其外部行为的情况下重组现有代码(分解)的过程。现在,如果您认为这个定义非常通用,请不要担心;它是故意的!重构可以采用许多同样有效的形式,具体取决于它所应用的代码。为了说明这一点,我们将“系统”定义为任何定义的代码集,它从一组输入产生一组输出。
Very simply put, refactoring is the process by which we restructure existing code (the factoring) without changing its external behavior. Now if you think that this definition is incredibly generic, don’t worry; it purposefully is! Refactoring can take many equally effective forms, depending on the code it’s applied to. To illustrate this, we’ll define a “system” as any defined set of code that produces a set of outputs from a set of inputs.
假设我们有一个名为 S 的系统的具体实现,如图 1-1所示。该系统是在紧迫的期限内构建的,这鼓励作者偷工减料。随着时间的推移,它变成了一大堆杂乱无章的代码。值得庆幸的是,系统的消费者不会直接接触到系统内部的混乱;他们使用定义的接口与 S 交互,并依赖它来提供一致的结果。
Say we have a concrete implementation of such a system called S, pictured in Figure 1-1. The system was built under a tight deadline, encouraging the authors to cut some corners. Over time, it’s become a large pile of tangled code. Thankfully, consumers of the system aren’t exposed to the internal mess of the system directly; they interact with S, using a defined interface and rely on it to provide consistent results.
一些勇敢的开发人员清理了系统的内部结构,我们现在将其称为 S',如图1-2所示。虽然它可能是一个更整洁的系统,但对于 S' 的消费者来说,绝对没有任何改变。
A few brave developers cleaned up the internals of the system, which we’ll now call S’, picture in Figure 1-2. While it might be a tidier system, to the consumers of S’, absolutely nothing has changed.
System S 可以是任何东西;它可以是一条if语句、一个十行函数、一个流行的开源库、一个数百万行的应用程序,或者介于两者之间的任何东西。(输入和输出可能同样多样化。)该系统可以对数据库条目、文件集合或数据流进行操作。输出不仅限于返回值,还可以包括许多副作用,例如打印到控制台或发出网络请求。您可以看到,负责操作用户实体的RESTful服务如何映射到图 1-3中的系统定义。
System S could be anything; it could be a single if statement, a ten-line function, a popular open source library, a multimillion-line application, or anything in between. (Inputs and outputs could be equally diverse.) The system could operate on database entries, collections of files, or data streams. Outputs aren’t limited to returned values, but could also include a number of side effects such as printing to the console or issuing a network request. You can see how a RESTful service responsible for operating on user entities might map to our definition of a system in Figure 1-3.
当我们继续构建重构的定义并开始探索该过程的不同方面时,确保我们达成共识的最佳方法是将每个想法与一个具体的例子联系起来。
As we continue to build on our definition of refactoring and begin exploring different aspects of the process, the best way to ensure we’re all on the same page is to connect each idea to a single, concrete example.
使用现实世界的编程示例很难,原因有几个。鉴于我们行业的经验范围广泛,选择一个示例而不是另一个示例会立即让一组读者占得先机。另一方面,那些非常熟悉该示例的人可能会感到沮丧,因为某些概念为了简洁而被简化,或者某些细微差别被忽略以更清晰地应用概念。为了建立一个公平的竞争环境,每当我们试图从高层次说明一个一般问题时,我们都会使用我们大多数人都熟悉的企业作为示例:干洗店。
Using real-world programming examples is difficult for a few reasons. Given the breadth of experiences in our industry, choosing just one example over another immediately gives one group of readers a leg up. On the flip side, those deeply familiar with the example might be frustrated when some concepts are simplified for brevity or when certain nuances are ignored to apply a concept more cleanly. In hopes of establishing a level playing field, whenever we seek to illustrate a generic problem at a high level, we’ll use as our example a business familiar to (hopefully) most of us: a dry cleaning establishment.
Simon's Dry Cleaners 是一家本地干洗店,位于斯普林菲尔德一条繁忙的街道上。该店周一至周六在正常营业时间内营业。顾客可寄送常规衣物和仅可干洗的衣物。根据每件衣物的数量、紧急程度和难度,衣物将在 2 至 6 个工作日内清洗完毕并返还给顾客。
Simon’s Dry Cleaners is a local dry cleaning business with a single location on a busy street in Springfield. It’s open Monday through Saturday during regular business hours. Customers drop off both regular laundry and dry-clean-only items. Depending on the quantity, urgency, and difficulty of each item, the items are cleaned and returned to the customers any time between two and six business days later.
这如何映射到我们对系统的定义?企业内的干洗操作就是系统本身。它将客户的脏衣服作为输入,并将其清洗干净后归还给主人作为输出。干洗操作的所有复杂细节对消费者来说都是隐藏的;我们需要做的就是把衣服放下,然后希望干洗工能够完成他们的工作。系统本身相当复杂;根据输入的类型(皮夹克、一堆袜子、真丝裙子),系统可能会通过执行一个或多个操作来确保正确的输出(一件干净的衣服)。在放下和取回之间,出现问题的机会很多:皮带可能会丢失、污渍可能被忽略、衬衫可能意外地退还给错误的客户。但是,如果员工主动沟通,机器状况良好,收据整齐,系统将继续平稳运行,订单完成也会很容易。
How does this map to our definition of a system? The dry cleaning operation housed within the business is the system itself. It processes customers’ dirty clothing as inputs and returns them cleaned to their owners as outputs. All of the intricacies of the dry cleaning operation are hidden from the consumer; all we need to do is drop off our clothes and hope the cleaners are able to do their job. The system itself is quite complex; depending on the type of input (leather jacket, pile of socks, silk skirt), it may respond by performing one or more operations to ensure the proper output (a clean garment). There is ample opportunity for something to go wrong between drop-off and pickup: a belt might get lost, a stain overlooked, a shirt accidentally returned to the wrong customer. However, if the employees proactively communicate with one another, the machines are in good condition, and the receipts are kept in order, the system will continue to operate smoothly and it’ll be easy to fulfill orders.
假设 Simon's 仍然使用纸质复写收据运营。所有前来寄送衣物的顾客都会在提供的收据上写下自己的姓名和电话号码,店员会记录下他们的订单。如果顾客弄丢了收据,Simon's 可以通过翻阅按姓氏字母顺序排列的近期订单轻松找到收据副本。不幸的是,当顾客迟到取干洗衣物时,收据就会丢失,店员必须从后台的箱子里取出存档的收据。虽然几乎所有订单都成功检索到,但顾客取走衣服并再次上路需要花费更多时间。当店主在每个月末计算收入时,纸质收据也很不方便;他们必须手动将所有交易(信用卡和现金)与已完成的订单进行匹配。团队渴望实现流程现代化和重构,因此决定升级系统以使用销售点系统并消除纸张的痛点。好了,重构完成了!顾客们仍然会把干洗衣物送去,几天后再取回,几乎感觉不到任何变化,但现在前台后面的一切都运行得更顺畅了。
Let’s say Simon’s still ran its operations using paper carbon-copy receipts. All customers coming in to drop off their clothes would write their name and phone number on the provided slip, and the clerk would take note of their order. If customers misplaced their receipts, Simon’s could easily locate the copy by leafing through their recent orders alphabetized by last name. Unfortunately, when customers are late to pick up their dry cleaning and they’ve misplaced their receipts, the clerk has to fetch archived slips from boxes in the back office. Although almost all orders are successfully retrieved, it takes much more time for the customer to pick up their apparel and be on their way again. Paper receipts are also inconvenient when the owners calculate their earnings at the end of each month; they have to match up all transactions (both credit card and cash) manually with completed orders. Eager to modernize and refactor their process, the team decided to upgrade their systems to use a point-of-sale system and erase the pain points of paper. Ta da, refactoring complete! Customers continue to drop off their dry cleaning and retrieve it a few days later with minimal perceived change, but now everything behind the front desk runs much more smoothly.
2013 年末,在混乱的发布过程中,美国各大新闻媒体都宣称 Healthcare.gov 彻底失败了;该网站存在安全隐患,服务中断长达数小时,并且存在大量严重漏洞。在发布之前,不仅成本膨胀到近 20 亿美元,代码库也膨胀到超过 500 万行。尽管 Healthcare.gov 的失败很大程度上是由于联邦政府官僚主义政策导致的开发实践失败,但当奥巴马政府后来宣布计划投入大量资金改善服务时,重新设计和重构过度增长的软件系统所涉及的不可否认的困难成为了主流新闻。在随后的几个月里,负责重写 Healthcare.gov 的团队一头扎进了几乎完全改造代码库的行动中,这是一次大规模的重构。
In late 2013, amidst a tumultuous launch, all major American news outlets declared Healthcare.gov a complete fiasco; the website was plagued with security concerns, hours-long outages, and a slew of serious bugs. Before launch, not only had the cost ballooned to nearly two billion dollars, the codebase had blown up to over five million lines of code. While much of the failure of Healthcare.gov was due to failed development practices caught up in bureaucratic federal government policies, when the Obama administration later announced that it was planning to invest heavily in improving the service, the undeniable difficulty involved with rearchitecting and refactoring overgrown software systems became mainstream news. In the subsequent months, the teams tasked with rewriting Healthcare.gov dove headfirst into a near-complete overhaul of the codebase, a refactor at scale.
大规模重构会影响系统的很大一部分。它通常(但不限于)涉及一个大型代码库(一百万行或更多代码),为拥有众多用户的应用程序提供支持。只要遗留系统存在,就需要进行此类重构,开发人员需要从广度上批判性地思考代码结构以及如何有效地对其进行可衡量的改进。重构数百万行代码库与重构较小、定义更明确的应用程序有何不同?虽然我们可能很容易找到具体的、迭代的方法来改进小型、定义明确的系统(想想单个函数或类),但几乎不可能确定将更改统一应用于庞大而复杂的系统时可能产生的影响。有许多工具可以识别代码异味或自动检测代码子段内的改进,但我们基本上无法自动化人类推理如何在以越来越快的速度增长的代码库中重构大型应用程序,尤其是在高增长公司中。
A refactor at scale is one that affects a substantial surface area of your systems. It typically (but not exclusively) involves a large codebase (a million or more lines of code) powering applications with many users. As long as legacy systems exist, there will be a need for these kinds of refactors, ones where developers need to think critically about code structure at breadth and how it can be measurably improved effectively. What makes refactoring multimillion-line codebases different from refactoring smaller, more well-defined applications? While it might be easy for us to see concrete, iterative ways to improve small, well-defined systems (think individual functions or classes), it becomes nearly impossible to determine the effect a change might have when applied uniformly across a sprawling, complex system. Many tools exist to identify code smells or automatically detect improvements within subsections of code, but we are largely unable to automate human reasoning about how to restructure large applications in codebases that are growing at an increasingly rapid pace, particularly at high-growth companies.
有些人可能会认为,通过不断应用小规模的附加变换,可以对这种系统进行衡量改进。这种方法可能会开始向积极的方向倾斜,但当大多数唾手可得的成果消失时,进展可能会大幅下降,而且谨慎(和逐步)引入这些变化会变得更加棘手。
Some may argue that you can make a measured improvement to this kind of system by continuously applying small, additive transformations. This method might begin to tilt the scales in a positive direction, but progress is likely to drop off significantly when most of the low-hanging fruit is gone and it becomes trickier to introduce these changes carefully (and gradually).
大规模重构是指识别代码库中的系统性问题、构思更好的解决方案并以战略性和规范性的方式执行该解决方案。要识别系统性问题及其相应的解决方案,您需要对应用程序的一个或多个主要部分有深入的了解。您还需要很强的耐力才能将解决方案正确地传播到整个受影响区域。
Refactoring at scale is about identifying a systemic problem in your codebase, conceiving of a better solution, and executing on that solution in a strategic, disciplined way. To identify systemic problems and their corresponding solutions, you’ll need a solid understanding of one or more broad sections of an application. You’ll also need high stamina to propagate the solution properly to the entire affected area.
大规模重构也与重构实时系统密切相关。我们中的许多人都在开发部署周期频繁的应用程序。在 Slack,我们每天向用户发送大约十几次新代码。我们需要注意我们的重构工作如何适应这些周期,以最大限度地降低对用户的风险和干扰。了解如何在重构工作的各个阶段进行战略性部署,通常可以决定是悄无声息地推出还是彻底的服务中断。
Refactoring at scale also goes hand in hand with refactoring live systems. Many of us work on applications with frequent deployment cycles. At Slack, we ship new code to our users about a dozen times per day. We need to be mindful of how our refactoring efforts fit into these cycles, to minimize risk and disruption to our users. Understanding how to deploy strategically at various points during a refactoring effort can oftentimes make the difference between a quiet rollout and a complete service outage.
从规模上看,西蒙干洗店会是什么样子?假设部署销售点系统极大地优化了业务——事实上,它如此之好,以至于它在短短两年内就在邻近城镇开设了五家新店!现在,它经营着多家分店,业务规模不断扩大,但他们面临着一系列不同的问题。为了降低成本,六家分店中只有两家有现场干洗设备。当顾客在没有现场干洗设备的四家分店之一送去干洗时,必须通过公司的面包车将服装送到最近的设施。面包车会在所有四家店面停下来取衣服,把它们放在两家干洗店装卸码头的大箱子里。西蒙的员工努力在成堆的衣服中分类、清洗,然后把它们送回正确的店面。然而,大多数时候,这是一个令人痛苦的过程。这两家干洗店都处理来自自己店面和四家较小店面的服装。货车司机将衣物丢进处理箱时,衣物散落或缠结在一起的情况并不罕见。更紧急的订单经常会遗失在堆里,清洁工必须先翻遍整批货物才能找到它们。
What might Simon’s Dry Cleaners look like when considering scale? Say deploying a point-of-sale system dramatically optimized the business—so much so, in fact, that it managed to open five new locations in neighboring towns in just two years! Now that it’s operating multiple locations, growing the scale of their business, they have a different set of problems. To keep costs low, only two of their six locations have dry cleaning equipment on-site. When customers drop off dry cleaning at one of the four locations that do not have dry cleaning equipment on-site, the apparel must be sent to the closest facility via the company van. The van stops at all four storefronts to pick up clothes, dropping them off in large bins on the loading docks of the two dry cleaning locations. Simon’s employees work hard to sort through the heaps of clothes, clean them, and return them to the correct storefront. Most days, however, it’s a harrowing process. Both dry cleaning locations process apparel from both their own location and the four smaller ones. It’s not uncommon for clothes to get separated or tangled when dropped into the processing bins by the van drivers. More urgent orders often get lost in the heap and cleaners have to dig through the entire shipment to identify them first.
Simon's 如何才能最有效地改善其运营?它是否应该为每个地点专门设立一个干洗中心,以便每个设施最多处理来自三个店面的订单?如果是这样,它是否应该考虑以 特定的方式重新安排货车的路线?如果两者都做会怎么样?如果能够减少业务的周转时间,那么再开一家干洗店是否具有成本效益?它应该如何设置装卸码头,以减少衣服缠结?在开车去下一轮之前,能否教会司机将订单挂起来并按紧急程度进行正确分类?公司是否应该将取货时间限制在午餐后和关门后不久,以便干洗店有更多时间组织送货?有相当多的选项需要考虑,其中许多选项可以组合起来并执行在多个订单上或同时执行。想象一下,面对所有这些可能性,必须决定先拉哪个杠杆。这绝对令人不知所措!事实证明,重构大型应用程序的感觉是一样的。
How can Simon’s improve its operations most efficiently? Should it dedicate a specific dry cleaning center for each location so that each facility is handling orders from a maximum of three storefronts? If so, should it consider rerouting the vans in a specific way? What if it did both? Would it be cost-efficient to open yet another dry cleaning facility if it enables the business to decrease turn-around time? How should it set up its loading docks so that fewer clothes get tangled? Could the drivers be taught to hang and categorize orders properly by urgency before driving off to make another round? Should the company limit pickups to right after lunchtime and shortly after closing to give the dry cleaning locations more time to organize the drop-offs? There are quite a few options to consider, many of which could be combined and executed on numerous orders or simultaneously. Imagine being faced with all of these possibilities and having to decide which lever to pull first. It’s positively paralyzing! Turns out, refactoring large applications feels the same way.
重构在理论上听起来很有说服力,但你怎么知道读这本书的其余部分不会浪费时间呢?我当然希望所有读者都能从这本书中学到一些新工具,但如果我能给出一个让你继续阅读的理由,那就是:
Refactoring might sound compelling in theory, but how do you know that reading the rest of this book won’t be a waste of time? I certainly hope that all readers can walk away from this book with a few new tools in their tool belt, but if there’s a single reason I can provide to keep you reading it’s this:
对自己重构能力的自信使你能够倾向于采取行动并尽早开始构建系统,远在你对所有移动部件、陷阱和极端情况有深入理解之前。如果你知道你能够在整个开发过程中有效地识别改进组件的机会,并且随着系统变得越来越复杂,你将继续能够这样做,那么你就不需要花费太多时间预先设计程序。一旦你磨练了轻松操作代码所需的技能,你就不会花太多时间担心自己被任何单一的设计决策所束缚。在编程时,你会发现自己选择编写一些在当前情况下可行的简单程序,而不是退后一步并计划接下来的六种操作。你会认识到总有一条(有时很棘手的)通往更好解决方案的道路。
Confidence in your ability to refactor allows you to lean toward action and start building a system sooner, well before you’ve developed a strong understanding of all the moving pieces, pitfalls, and edge cases. If you know you’ll be able to identify opportunities to improve components effectively throughout the development process, and will continue to be able to do so as the system grows more complex, you won’t need to spend as much time architecting a program upfront. Once you’ve honed the skills required to manipulate code effortlessly, you’ll spend less time worrying about boxing yourself in with any single design decision. While programming, you’ll find yourself opting to write something simple that works given the current circumstances rather than stepping back and planning your next half-dozen moves. You’ll recognize that there is always a (sometimes tricky) path to a better solution.
编程不是下棋。当给定棋盘配置并假设最佳对手时,最优秀的竞技选手会在几分钟内巧妙地完成数十场完整的比赛。不幸的是,在我们的工作中,我们没有得到一套完整的可能动作,也没有预先确定的最终状态。我并不是说,在给定一组合理的要求的情况下,坐下来集思广益为问题找到一个强大的解决方案是没有意义的;然而,我确实想提醒你不要花费大量时间来解决最后的 10% 到 20%。如果你已经磨练了你的重构能力,你将能够改进你的解决方案以很好地处理最终的规范。
Programming isn’t a game of chess. When given a board configuration and assuming optimal opponents, the best competitive players deftly play out dozens of complete matches within minutes. Unfortunately, in our line of work, we aren’t provided a fully enumerated set of possible moves and there is no predetermined end state. I don’t mean to imply that there is no value in sitting down and brainstorming a robust solution to a problem, given a reasonable set of requirements; however, I do want to caution you against spending any significant time ironing out the final 10 percent to 20 percent. If you’ve honed your ability to refactor, you’ll be able to evolve your solution to handle the final specifications just fine.
除了能够更快地自信地开始解决问题之外,重构还可以带来一些切实的好处。虽然它可能不是解决所有问题的正确工具,但它肯定会对您的应用程序、工程团队和更广泛的组织产生持久的积极影响。我们讨论了两个主要好处:提高开发人员的工作效率和更轻松地识别错误。虽然有些人可能认为重构的好处远不止这里讨论的这些,但我认为它们都归结为这里提出的两个主题。
Refactoring can have some tangible benefits beyond the ability to start confidently problem-solving sooner. Though it might not be the correct tool for every problem, it can certainly have a lasting, positive impact on your application, engineering team, and broader organization. We discuss two major benefits: increased developer productivity and greater ease identifying bugs. While some might contend that there are many more benefits to refactoring than those discussed here, I argue that they all boil down to the two themes presented here.
重构的主要目标之一是生成更易于理解的代码。在推理过程中简化密集的解决方案不仅可以帮助您更好地了解代码的作用,还可以帮助您的后续人员做同样的事情。您可以轻松理解的代码绝对可以提升团队中每个人的水平,无论他们的任期或经验水平如何。
One of the primary goals of refactoring is yielding code that is easier to understand. Simplifying a dense solution as you reason through it not only helps you gain a better grasp of what the code is doing, it also helps everyone who comes after you do the same. Code you can easily comprehend elevates absolutely everyone on your team, no matter their tenure or experience level.
如果您是团队中的资深工程师,您往往对代码库的某些部分非常熟悉,但是随着代码库的增长,越来越多的部分对您来说并不熟悉,并且您的代码越来越有可能对这些部分产生依赖。想象一下,您正在实现一个新功能,在将解决方案贯穿整个系统时,您会从您非常熟悉的代码冒险进入不熟悉的领域。如果您不熟悉的领域得到很好的维护,并定期重构以考虑不断变化的产品需求和错误修复,您将能够缩小更改的理想位置,并更快地找到轻松的解决方案。如果代码随着时间的推移而恶化,不断积累不完整的错误修复并不断膨胀,您将花费成倍增加的时间仔细阅读每一行,首先尝试了解代码在做什么以及它是如何工作的,然后才能花时间推理出可接受的解决方案。 (将其他人拖入折磨代码的兔子洞并不罕见,无论是与您一起工作的另一位工程师,还是熟悉代码来回答您的问题的人。)
If you are a tenured engineer on the team, you tend to be very familiar with some parts of the codebase but, as the codebase grows, more and more parts are unfamiliar to you, and your code is increasingly likely to develop dependencies on those parts. Imagine that you’re implementing a new feature and in weaving your solution through the system, you venture from code you know rather well to unfamiliar territory. If the area unknown to you is well maintained and regularly refactored to take into account evolving product requirements and bug fixes, you’ll be able to narrow down the ideal location for your change and intuit an effortless solution much more quickly. If the code has instead deteriorated over time by accruing patchy bug fixes and ballooning in length, you’ll spend exponentially more time wading through each line, trying first to understand what the code is doing and how it’s doing it before you’re able to spend any time reasoning through an acceptable solution. (It’s not uncommon to drag someone else into the tortured-code rabbit-hole, whether it’s another engineer working alongside you or one who’s intimately familiar with the code to answer your questions.)
让我们换个场景。如果另一个团队的同事不熟悉您团队的代码,不得不尝试阅读它,会怎么样?他们会很容易地理解它的工作原理吗?您是否更有可能期待问题和困惑的表情,还是要求进行代码审查?
Let’s flip the scenario. What if a colleague on another team who isn’t familiar with your team’s code had to take a stab at reading through it. Would they have an easy time understanding how it works? Are you more likely to expect questions and confused looks, or a request for code review?
如果您是团队中的新工程师,情况会怎样?也许您刚刚加入团队,或者您最近刚刚招募了某人,您可以借鉴他的经验。他们对代码库完全没有心理模型。他们对代码任何领域获得信心的能力与代码的可读性成正比。他们不仅能够有机地建立代码库中不同单元之间关系的准确心理表征,还能够推断出代码正在做什么,而无需标记队友以提出问题。(值得注意的是,知道何时以及如何向同事提问是一项非常重要的技能。学会评估在寻求帮助之前花多少时间建立自己的理解是困难的,但对于成长为开发人员至关重要。提问不是坏事,但如果您是团队中的资深工程师,并且感觉被问题轰炸,也许是时候编写一些文档并重构一些代码了。)
What if you were a new engineer on the team. Perhaps this was you just recently or maybe you recently onboarded someone to your team, whose experiences you can pull from. They have absolutely no mental model of the codebase. Their ability to gain confidence with any area of the code is directly proportional to the code’s legibility. Not only will they be able to organically build up an accurate mental representation of the relationships between different units in your codebase, they’ll be able to reason out what the code is doing without needing to tag teammates for questions. (It’s worth noting that knowing when and how to ask questions of your colleagues is an incredibly important skill to hone. Learning to evaluate how much time is appropriate for you to build your own understanding before seeking help is difficult but critical to growing as a developer. Asking questions isn’t a bad thing, but if you’re the tenured engineer on the team and you’re feeling bombarded with them, maybe it’s time to write some documentation and refactor some code.)
在开发新事物时,我们都倾向于复制既定的模式。如果我们参考的解决方案清晰简洁,我们更有可能传播清晰简洁的代码。反之亦然:如果我们唯一可参考的解决方案是杂乱无章的,我们就会传播杂乱无章的代码。确保最佳模式是最普遍的模式对于与刚起步的开发人员建立积极的反馈循环尤为重要。如果他们经常接触的代码易于理解,他们会在自己的解决方案中效仿类似的重点。
We’re all prone to copying established patterns when developing something new. If the solutions we reference are clear and concise, we’re more likely to propagate clear and concise code. The converse is also true: if the only solutions we have as reference are cluttered, we’ll propagate cluttered code. Ensuring that the best patterns are the most prevalent ones is particularly crucial in establishing a positive feedback loop with developers who are just starting out. If the code that they interact with on a regular basis is easy to understand, they’ll emulate a similar focus in their own solutions.
追踪和解决错误是我们工作中必不可少的一部分(而且很有趣!)。重构可以成为完成这两项任务的有效工具!通过将复杂的语句分解成更小、更简单的部分,并将逻辑提取到新函数中,您既可以更好地了解代码的作用,又可以隔离错误。在积极编写代码时进行重构还可以使您更容易在开发过程的早期发现错误,从而完全避免错误。
Tracking down and solving bugs is a necessary (and fun!) part of our jobs. Refactoring can be an effective tool in accomplishing both of these tasks! By breaking up complex statements into smaller, bite-sized pieces, and extracting logic into new functions, you can both build up a better understanding of what the code is doing and, hopefully, isolate the bug. Refactoring as you are actively writing code can also make it easier to spot bugs early in the development process, allowing you to avoid them altogether.
想象一下这样的场景:几个小时前,您的团队将一些新代码部署到生产环境中。一些更改嵌入在几个每个人都害怕修改的文件中:代码无法阅读,并且包含大量可能出现的错误。不幸的是,您的测试没有涵盖许多极端情况之一,客户服务人员联系您,告知用户开始遇到的棘手错误。您和您的团队立即开始深入研究,并很快意识到,正如预期的那样,错误位于代码中最可怕的部分。值得庆幸的是,您的队友能够一致地重现问题,并且您们一起编写了一个测试来断言正确的行为。现在您必须缩小错误范围。您采取有条不紊的步骤来分解棘手的代码:将冗长的单行代码转换为简洁的多行语句,并将一些条件代码块的内容迁移到单个函数中。最终,您找到了错误。现在代码已经简化,您可以快速修复它,运行测试以验证它是否有效,并将修复程序发送给您的客户。胜利!
Consider the scenario in which your team deployed some new code to production a few hours ago. A few of the changes were embedded in a handful of files that everyone fears modifying: the code is impossible to read and contains a minefield of bugs waiting to happen. Unfortunately, your tests didn’t cover one of many edge cases and someone from customer service reaches out about a pesky bug users are starting to run into. You and your team immediately start digging in and quickly realize that the bug is, as expected, in the scariest part of the code. Thankfully, your teammate’s able to reproduce the problem consistently and, together, you write a test to assert the correct behavior. Now you have to narrow down the bug. You take methodical steps to break down the hairy code: you convert lengthy one-liners into succinct, multiline statements and migrate the contents of a few conditional code blocks into individual functions. Eventually, you locate the bug. Now that the code’s been simplified, you’re able to fix it swiftly, run the test to verify that it works, and ship a fix to your customers. Victory!
对于客户来说,有时错误只是小麻烦,但有时,错误可能会阻止客户使用您的应用程序。虽然更具破坏性的错误通常需要紧急补救,但您的团队必须能够快速解决所有严重程度的错误,以让用户满意。在维护良好的代码库中工作可以大大减少开发人员需要解决和修复错误的时间,当它在创纪录的时间内交付生产时,您会感到高兴。
To the customer, sometimes bugs are only a minor nuisance, but other times, bugs can prevent the customer from using your application altogether. While more disruptive bugs generally require urgent remediation, it’s imperative that your team be able to solve bugs of all severity levels quickly to keep users happy. Working in a well-maintained codebase can dramatically decrease the time developers need to hone in on and fix a bug, delighting you when it’s shipped to production in record time.
虽然重构的好处可能引人注目,但在开始改进代码库的每一寸(或厘米)之前,需要考虑一些严重的风险和陷阱。我可能开始听起来像在老生常谈,但我还是要重申:重构要求我们能够确保行为在每次迭代中都保持一致。我们可以通过编写一套测试(单元、集成、端到端)来增强我们对没有发生任何变化的信心,并且在建立足够的测试覆盖范围之前,我们不应该认真考虑继续进行任何重构工作。但是,即使经过彻底的测试,也总有小概率会出现疏漏。我们还必须牢记我们的最终目标:以一种对您和未来与代码交互的开发人员都清晰的方式改进代码。
While the benefits of refactoring might be compelling, there are some serious risks and pitfalls to consider before setting out on a journey to improve every inch (or centimeter) of your codebase. I may be starting to sound like a broken record, but I will reiterate it nonetheless: refactoring requires us to be able to ensure that behavior remains identical at every iteration. We can increase our confidence that nothing has changed by writing a suite of tests (unit, integration, end to end), and we should not seriously consider moving forward with any refactoring effort until we’ve established sufficient test coverage. However, even with thorough testing, there is always a small chance that something slips through the cracks. We also must keep in mind our ultimate goal: bettering the code in a way that is clear to both you and future developers interacting with the code.
重构未经测试的代码非常危险,因此极不建议这么做。即使配备了最全面、最复杂的测试套件的开发团队仍然会在生产中出现错误。为什么?每次更改(无论大小)都会以可衡量的方式破坏系统的平衡。我们力求尽可能减少破坏,但只要我们更改系统,就有可能导致无法预料的回归。在重构代码库中极其可怕、令人费解的角落时,引入严重的回归尤其令人担忧。代码库的这些区域通常处于当前状态,因为它们有足够的时间来恶化。在快速发展的公司中,它们通常既是应用程序运行不可或缺的部分,也是测试最少的部分。尝试理清这些文件或功能就像试图毫发无损地穿过雷区一样 — — 这是可能的,但非常危险。
Refactoring untested code is very dangerous and highly discouraged. Development teams equipped with the most thorough, sophisticated testing suites still ship bugs to production. Why? With every change, large or small, we disrupt the equilibrium of the system in a measurable way. We strive to cause as little disruption as possible, but whenever we alter our systems, there is a risk that it might lead to unanticipated regression. As we refactor the exceptionally frightening, puzzling corners of our codebase, introducing a serious regression is of particular concern. These areas of the codebase are frequently in their current state because they’ve had plenty of time to deteriorate. At fast-growing companies, they are also frequently both integral to how your application works and the least tested. Attempting to detangle these files or functions can feel like trying to walk across a minefield unscathed—it’s possible, but very dangerous.
重构不仅能帮你识别错误,还能无意间发现潜在的错误。在这里,我将潜在的错误归类为最常因重构代码而暴露的回归。我们以 Simon's Dry Cleaners 为例进行说明。该企业已开始以相同的交货节奏订购大批量的清洁产品,以便从供应商那里获得更好的交易。不幸的是,主店面后面没有太多空间来存放产品,因此 Simon's 决定开始将箱子堆放在离装卸码头门更近的地方。经过几周的降雨,团队注意到离门最近的一些箱子被打湿了,散架了。店主注意到后门密封性不好,在潮湿的日子里水会渗进来。Simon's 从未遇到过将物资存放在装卸码头门附近的问题,因为他们以前从未这样做过;实施新的存储模式暴露了他们基础设施中的一个关键缺陷,否则他们可能永远不会发现这个缺陷。
Just as refactoring can help you identify bugs, it can unintentionally unearth dormant bugs. Here, I classify dormant bugs as regressions that are most commonly exposed by restructuring code. We’ll revisit Simon’s Dry Cleaners to illustrate. The business has started ordering cleaning products in bigger batches at the same delivery cadence to unlock a better deal from the supplier. Unfortunately, there’s not much room to store the products in the back of the main storefront, so Simon’s decides to start stacking boxes closer to the loading dock door. After a few weeks of rain, the team notices that some of the boxes closest to the door are wet and falling apart. The owner notices that the back door is poorly sealed and allows water to seep through on wet days. Simon’s had never encountered a problem with storing supplies close to the loading dock door because they’d simply never done it before; exercising a new storage pattern exposed a critical flaw in their infrastructure, which they might have never discovered otherwise.
重构有点像吃布朗尼蛋糕:前几口味道很好,但很容易忘乎所以,不小心吃掉整整一打。当你吃完最后一口时,你会感到一丝后悔,也许还有点恶心。当你进行集中、局部的更改时,体验到立竿见影的、极为显著的改进,这是一种非常有益的体验!很容易忘乎所以,让更改的表面积超出合理的范围。我所说的合理范围是什么意思?根据代码库,这可以指单个功能区域或一小段相互依赖的库集。理想情况下,重构的代码仅限于另一个开发人员可以在单个变更集内舒适地审查的一组变更。
Refactoring can be a little bit like eating brownies: the first few bites are delicious, making it easy to get carried away and accidentally eat an entire dozen. When you’ve taken your last bite, a bit of regret and perhaps a twinge of nausea kick in. Experiencing immediate, highly significant improvements when you’re making focused, localized changes is incredibly rewarding! It’s easy to get carried away and allow the surface area of your changes to exceed reasonable bounds. What do I mean by reasonable bounds? Depending on the codebase, this can refer to a single functional area or a small, interdependent set of libraries. Ideally, the refactored code is limited to a set of changes another developer can comfortably review within a single changeset.
在规划更大规模的重构工作时,尤其是可能需要几个月甚至更长时间的重构工作时,绝对有必要严格控制范围。在重构小范围(几行代码、单个函数)时,我们都会遇到意想不到的怪癖;虽然我们可以持续地链接一些增强功能来有效地处理这些新怪癖,但在处理较大的范围时,这种方法会变得很危险。计划重构的范围越大,您遇到的可能没有预料到的问题就越多。这并不会让您成为一个糟糕的程序员,只是让您变得平易近人。通过坚持明确的计划,您可以降低导致严重回归或遇到潜在错误的可能性,并提高生产力。持续、有条不紊的重构工作已经很困难了;不断变动的目标只会让它们无法实现。
When mapping out a larger refactoring effort, especially one that might take several months or more, it’s absolutely imperative to keep a tight scope. We all run into unexpected quirks when refactoring small surface areas (a few lines of code, single functions); while we can sustainably chain a few enhancements to handle these new quirks effectively, this approach becomes dangerous when tackling a significant surface area. The larger the surface area of the planned refactor, the more problems you’ll encounter that you likely haven’t anticipated. That doesn’t make you a bad programmer, it simply makes you human. By keeping to a well-defined plan, you decrease the chances of causing a serious regression or running into dormant bugs, and promote productivity. Sustained, methodical refactoring efforts are already difficult; having a moving goalpost simply makes them unachievable.
开始时要小心不要过度设计,并愿意修改你的初始计划。主要目标应该是编写人性化的代码,即使以牺牲原始设计为代价。如果重点放在解决方案而不是流程上,那么你的应用程序最终会比最初更加做作和复杂的可能性就更大。所有级别的重构都应该是迭代的。通过朝着一个方向采取小而慎重的步骤并在每次迭代中保持现有行为,你能够更好地专注于你的最终目标。当你只处理屏幕上足够的代码而不是一次处理三十多个库时,这要容易得多。当我们计划一个新项目时,我们大多数人通常会尽力制定详细的规范文档和执行计划。即使重构工作量很大,重要的是要清楚地了解完成后的代码应该是什么样子。
Be wary of over designing at the start and be open to modifying your initial plan. The primary goal should be to produce human-friendly code, even at the cost of your original design. If the laser focus is on the solution rather than the process, there’s a greater chance your application will end up more more contrived and complicated than it was in the first place. Refactoring at all levels should be iterative. By taking small, deliberate steps in one direction and maintaining existing behavior at each iteration, you’re better able to maintain focus on your ultimate goal. This is much easier to do when tackling only enough code as fits on your screen rather than three dozen libraries at a time. When we plan a new project, most of us generally try our best to develop a detailed specification document and execution plan. Even with a large refactoring effort, it’s important to have a good sense of what the resulting code should look like upon completion.
简单地说“当好处大于风险时”是很容易的,但这不是有用的答案。是的,在实践中,当好处大于风险时,重构是值得的,但我们如何正确地为拼图的每一部分分配权重?我们如何知道我们何时达到了临界点并应该考虑重构?
It would be easy simply to say “when the benefits outweigh the risks,” but that wouldn’t be a helpful answer. Yes, in practice, refactoring is a worthwhile effort when the benefits outweigh the risks, but how do we properly assign weight to each piece of the puzzle? How do we know when we’ve reached the tipping point and should consider a refactor?
根据我的经验,临界点更像是一个临界范围,每个人和每个应用程序的情况都不同。确定这个范围的上限和下限使得重构更像是一门主观科学:没有公式可以给我们一个决定性的“是”或“否”答案。幸运的是,我们可以依靠他人经验中的一些经验证据来指导我们做出自己的决定。
In my experience, the tipping point is more of a tipping range, and it is different for everyone and every application. Determining your upper and lower bounds for this range is what makes refactoring a bit more of a subjective science: there is no formula we can use to give us a decisive “yes” or “no” answer. Fortunately, we can rely on some empirical evidence from others’ experiences to guide us in making our own decisions.
当您想要重构一小段经过充分测试的代码时,应该没有什么事情会阻碍您。除非您不确定重构后的解决方案是否比其前身有客观改进,或者您担心更改会影响太大的范围,否则这可能是值得的。仔细编写一些提交并开始进行更改!我们将在本章后面看到一个明显属于此类别的示例。
When looking to refactor a small, straightforward section of well-tested code, there should be very little holding you back. Unless you’re uncertain that your refactored solution is an objective improvement to its predecessor, or you’re fearful the change affects too large of a surface area, it’s likely a worthwhile endeavor. Carefully craft a few commits and get your changes rolling! We’ll see an example that clearly falls into this category later in this chapter.
有时,我们不得不冒险进入代码库中我们害怕的部分。每次我们阅读代码时,我们的眉头都会皱起,我们的心会怦怦直跳,我们的神经元开始放电。然后,我们就必须咬紧牙关,深入研究,做出我们想要做出的改变。但在压力下进行开发肯定会无意中造成更多问题。当你如此专注于做正确的事情,脑子里想着问题的各个方面时,你就有可能忘记你的实际目标。当你的心思在别处时,你如何才能充分实现那个目标?
There are times when we have to venture into parts of our codebase we fear. Every time we read over the code, our brows furrow, our hearts pound, our neurons start firing. Then comes the moment when we have to bite the bullet, dig in, and make the change we came to make. But developing under duress is a surefire way to inadvertently cause more problems. When you’re so hyper-focused on doing precisely the correct thing, holding the many dimensions of the problem in your head, you risk losing sight of your actual goal. How can you execute adequately on that goal when your mind is elsewhere?
如果代码的这个特定部分还没有给我们带来麻烦,我们通常会冒险尝试。如果代码已经给我们或我们的队友带来麻烦(有时不止一次),那么现在对代码进行彻底检查以防止将来出现错误的风险可能大于让它继续停留在当前状态的风险。如果您不确定天平会向哪个方向倾斜,请与您的队友讨论并收集一些数据,了解过去六个月内发现的错误数量,以便您可以追溯到代码库的这一部分。
If this particular section of the code hasn’t bitten us yet, we’ll often take our chances and make it. If it’s bitten us or a fellow teammate already (sometimes more than once), the risk involved in taking a scalpel to the code now to prevent future mistakes might outweigh the risk of letting it linger in its current state any longer. If you’re unsure which way the scales tilt, talk it over with your teammates and collect some data on the number of bugs caught in the past six months that you can trace back to this part of the codebase.
产品需求的急剧变化往往会导致代码的急剧变化。尽管我们努力为应用程序中的每个功能编写抽象、可扩展的解决方案,但我们无法预测未来;虽然我们的代码可能很容易适应小偏差,但它很少能完美适应较大的偏差。这些变化为我们提供了难得的业务相关机会,让我们重新回到绘图板并重新考虑我们的设计。
Drastic shifts in product requirements can frequently map to drastic shifts in code. As hard as we might try to write abstract, extendable solutions for each piece of functionality in our application, we can’t predict the future; and while our code might be easy to adapt for small deviations, it is seldom perfectly adaptable to larger ones. These shifts give us the rare business-related opportunity to go back to the drawing board and reconsider our design.
您可能认为这些转变不可能保留行为。给定相同的输入,现在我们必须提供不同的输出!这怎么可能是重构的好时机呢?如果您当前的代码不能很好地满足新要求,那么您必须想出一个解决方案,继续支持今天的功能,并无缝支持明天的功能。您可以先重构代码,然后(只有这样!)在其上实现新功能。这样,您就可以继续设定高质量代码的标准,充分利用重构的所有好处,同时支持业务目标。再说一次,这是双赢!
You may be thinking that these sorts of shifts can’t possibly preserve behavior. Given the same inputs, now we must provide different outputs! How is this an opportune time for refactoring? If your code in its current state doesn’t lend itself well to the new requirements, you must come up with a solution that continues to support today’s functionality and will seamlessly support tomorrow’s. You can make a case for refactoring your code first, and then (and only then!) implement the new functionality atop it. This way, you continue to set a standard of high-quality code, cashing in all the benefits of refactoring, all the while supporting business objectives. Again, it’s a win, win, win!
提高绩效可能是一项艰巨的任务;您必须首先深入了解现有行为,然后能够确定可以使用哪些杠杆来将天平向积极的方向倾斜。从头开始(或首先构建一个)将最有助于您做到这一点。正确隔离您已确定的杠杆,以便更容易操纵它们而不会产生下游效应的风险,这也是关键。
Improving performance can be a difficult task; you must first build a deep understanding of the existing behavior and then be able to identify which levers you might be able to use to tilt the scales in a positive direction. Beginning with a clean slate (or building one as a first step) will best enable you to do that. Properly isolating the levers you’ve identified so that they are easier to manipulate without risk of downstream effects is also key.
并非所有开发人员都认为性能改进是重构的正当理由;有些人断言系统的性能天生就是其行为的一部分,因此以某种方式改变它会改变行为。我不同意。如果我们继续使用我们提供一组输入的通用系统来定义重构,并继续产生一组预期的输出,那么提高生成这些输出所需的速度(或内存负担)就是一种有效的重构形式。
Not all developers believe that performance improvements are a valid reason to refactor; some assert that a system’s performance is innately part of its behavior and therefore altering it in some way alters the behavior. I disagree. If we continue to define refactoring by using our generic system to which we provide a set of inputs, and continue to produce an expected set of outputs, then improving the speed (or memory burden) required to generate these outputs is a valid form of refactoring.
为此目的的重构在一个重要方面是独一无二的:它不能确保结果代码更易于理解。有时,我们会通读代码库,并看到一长串警告其下方代码的注释块。根据我的经验,大多数这样的注释块都会提醒读者注意一个(或多个)复杂情况:奇怪的应用程序行为、临时解决方法和一个特殊的性能补丁。这些短篇故事开头的大部分性能改进都写得很巧妙,利用了对代码库的深刻理解,以此将受影响的表面积降到最低。这些“改进”更容易在较短的时间内退化,因此并不是重构旨在促进的可持续性的好例子。有价值的性能改进,值得纳入重构范畴的改进,是深刻而深远的;它们是有效大规模重构的例子。我们将在第二部分中更深入地介绍这些变化。
Refactoring for this purpose is unique in one important way: it does not ensure more approachable code as an outcome. Sometimes we’ll be reading through a codebase and come across a lengthy comment block warning about the code below it. In my experience, most of these comment blocks caution the reader about one (or more) complications: strange application behavior, temporary workarounds, and a peculiar performance patch. Most performance improvements prefaced by these short stories are written cleverly and leverage a deep understanding of the code base as a means of minimizing the surface area affected. These “improvements” are more susceptible to degradation over a shorter period and as such are not good examples of the sustainability that refactoring is meant to foster. The worthwhile performance improvements, the ones worthy of falling under the refactoring umbrella, are profound and far-reaching; they are examples of effective refactoring deployed at scale. We’ll cover these changes in greater depth in Part II.
在软件开发领域,我们经常采用新技术。无论是为了跟上行业的最新趋势,提高我们扩展到更多用户的能力,还是以新的方式完善我们的产品,我们都在不断评估新的开源库、协议、编程语言、服务提供商等。我们不会轻易决定使用新的东西;这部分是由于我们现有代码库中的集成成本。如果我们选择用新解决方案替换现有解决方案,我们必须制定弃用计划,确定所有受影响的调用站点并迁移它们(有时一次一个)。如果我们选择在未来采用一项新技术,我们必须确定高杠杆率的候选者以供早期采用,并制定计划将使用范围扩展到所有相关用例。
In the world of software development, we’re regularly adopting new technologies. Whether it’s to keep up with the newest trends in our industry, boost our ability to scale to more users, or mature our product in a new way, we’re perpetually evaluating new open-source libraries, protocols, programming languages, service providers, and more. Making the decision to use something new is not something we do lightly; this is partly due to the cost of integration within our existing codebases. If we opt to replace an existing solution with a new one, we have to craft a deprecation plan by identifying all affected callsites and migrating them (sometimes one at a time). If we opt to adopt a new technology moving forward, we have to identify high-leverage candidates for early adoption, with a plan to expand usage to all relevant use cases.
我不会一一列举使用新技术对系统产生的影响(影响方式有很多),但从这两种情况可以清楚地看出,每种情况都需要对当前系统进行仔细审核。幸运的是,审核可以揭示重构的绝佳机会!我想花点时间承认这是一个有争议的观点。由于采用新技术本身存在风险,其他开发人员可能会阻止您进行任何其他更改。但是,我坚信,将新事物引入系统的最糟糕方式是将其与巨大而混乱的事物放在一起。为了让它有最大的机会实现其目的,我认为最好先花时间清理它将接触的区域。
I won’t enumerate each of the ways using a new technology can affect your system (there are many), but it’s clear from these two scenarios that each requires a careful audit of your current system. Fortunately, an audit can reveal prime opportunities for refactoring! I want to take the time to acknowledge that this is a somewhat controversial opinion. Because of the risks involved with adopting a new technology alone, other developers may discourage you from making any other changes. However, I strongly believe that the worst way to introduce something new into your system is to stick it right in alongside a huge, tangled mess. To give it the best chance to fulfill its purpose, I think it’s best to take the time to clean up the areas it’ll come in contact with first.
我们可以轻松地将这一概念应用到西蒙干洗店。假设这家店最近订购了一些新型的、先进的、环保的干洗机器。在制定安装计划时,店主意识到他们现有的平面图存在严重的效率低下问题。员工必须沿着机器线一直走,才能从近三十英尺外的衣架上取下预先分类好的衣服。如果他们重新调整机器的方向,让员工只需走几英尺就能到达衣架,他们可能会在每个周期中节省几分钟。他们决定按照修改后的配置安装新机器。西蒙干洗店可能已经减少了对环境的影响,并提高了员工的工作效率。双赢!
We can easily apply this concept to Simon’s Dry Cleaners. Let’s say it just recently put in an order for some new state-of-the-art, eco-friendly dry cleaning machinery. In figuring out an installation plan, the owners realize that their existing floor plan has some serious inefficiencies. Employees have to walk all the way along the line of machines to pick up presorted garments from the racks nearly thirty feet away. If they reorient the machinery so that employees can walk just a few feet to reach the racks, they might shave a few minutes off of every cycle. They make the decision to install the new machines in the revised configuration. Simon’s may have decreased its impact on the environment and increased the productivity of their employees. Win, win!
对于开发人员来说,重构可能是一个非常有用的工具。许多开发人员认为,花时间进行重构总是值得的,但事情并非如此简单。重构有其时间和地点,最成熟的开发人员了解何时重构和何时不重构的重要性。
Refactoring can be an astonishingly useful tool to a developer. Many developers believe that time devoted to refactoring is always time well spent, but it isn’t so simple. There is a time and a place for refactoring, and the most mature developers understand the importance of knowing when to refactor and when not to refactor.
闭上眼睛,想象自己坐在电脑前。你正在看一个特别棘手的函数。它太长了;它试图做太多的事情。它的名字早已不再有意义地描述它的职责。你迫不及待地想修复它。你很想把它分成定义明确、简洁的单元,并使用更好的变量名。这会很有趣。但这是你现在可以做的最重要的事情吗?也许你的队友已经等了你几天的代码审查,或者你一直在推迟编写一些测试?如果你正在研究一些老旧的代码并对其进行修改以让自己开心,那么你可能会对自己(和你的队友)造成伤害。
Close your eyes for a minute and imagine yourself sitting in front of your computer. You’re looking at a particularly gnarly function. It’s too long; it tries to do too many things. Its name has long since ceased to describe its responsibility meaningfully. You’re itching to fix it. You’d love to split it up into well-defined, succinct units complete with better variables names. It’d be fun. But is it the most important thing you could be doing right now? Perhaps your teammate’s been waiting for your code review for a few days or you’ve been putting off writing some tests? If you’re digging into some crufty old code and shifting it around to keep yourself entertained, you might be doing yourself (and your teammates) a disservice.
很有可能,如果你只是为了好玩而进行重构,你就不会关注你的改变对周围代码、整个系统和你的同事的影响。当我们为了好玩而进行重构时,我们有不同的动机:我们更有可能使用更牵强的语言特性,或者尝试我们一直想尝试的全新模式。尝试新事物和锻炼编程能力是有时间和地点的,但重构不是那个时候。重构应该是一个深思熟虑的过程,重点是严格提供(理想情况下)最小的改变以获得最大的积极影响。
Chances are, if you’re refactoring for fun, you’re not focusing on the impact that your change will have on the surrounding code, the overall system, and your coworkers. We have different motivations when we’re refactoring for fun: we’re more likely to use more far-fetched language features or try out a brand new pattern we’ve been wanting to give a whirl. There is a time and place for trying new things and stretching our programming muscles, but refactoring isn’t that time. Refactoring should be a deliberate process where the focus is strictly on providing the (ideally) smallest change for the biggest positive impact.
想象一下:你写了一些代码,将其发布到生产环境,然后开始开发一个新功能。几个月后,你又回过头来扩展这个功能。不幸的是,它看起来和你最初写的代码完全不一样。无数个问题在你的脑海中闪过。这里发生了什么?
Picture this: you write some code, ship it to production, and start working on a new feature. You come back to your code a few months later to expand on the feature. Unfortunately, it looks nothing like what you originally wrote. A million questions are racing through your mind. What happened here?
您可能已经成为“驱动重构者”的牺牲品。这种同事经验丰富,对如何编写代码有着深刻的见解。他们是其他工程师在设计决策方面咨询的对象。他们还具有在遇到他人代码时重写代码的不良倾向。他们认为这样做是在帮大家的忙。
You may have fallen prey to the drive-by refactorer. This is a coworker who is experienced enough to have developed some well-informed opinions about how to write code. They are someone whom other engineers consult with about design decisions. They also have an unfortunate tendency to rewrite others’ code as they encounter it. They think they’re doing everyone a favor by doing this.
您可能很想同意这一观点,但请考虑一下:如果这位工程师修改了他们不是活跃贡献者的代码库区域中的代码,则很可能降低了负责该代码的人员的生产力。当我们熟悉自己负责的代码时,我们的工作效率最高。当我们被要求快速解决问题时,无论是生产中的严重事故还是小错误,我们都会使用代码的心理模型来缩小可能存在问题的一组文件、类或函数的范围。如果我们打开编辑器,发现所有内容都不在我们之前的位置,我们就会迷失方向,无法快速解决问题。这会给我们的雇主带来巨大的成本,包括工程时间、客户服务时间,甚至可能造成业务损失。
You might be tempted to agree, but consider this: if this engineer modified code in an area of the codebase where they are not an active contributor, it’s likely they’ve decreased the productivity of those that are responsible for it. We are most productive when we are familiar with the code for which we are responsible. When we’re tasked with quickly resolving an issue, whether it is a serious incident in production or a small bug, we use our mental model of the code to narrow a set of files, classes, or functions where the problem might exist. If we open up our editor and find that nothing is where we left it, we’re disoriented and unable to fix the issue as quickly. This is incredibly costly to our employers in engineering hours, customer service hours, and potentially lost business.
不告诉原作者重构一事在两个方面都是不利的。首先,他们已经主动侵蚀了作者的信任。尽管我们试图将自己与代码分开,但我们总是会在自己编写的代码中留下一小点个人自豪感和所有权。我更希望有人能诚实地告诉我解决方案的缺点,并告诉我如何修复它,而不是在问题已经解决之后才发现问题。对于新工程师来说,这尤其有害。想象一下,你刚从学校毕业一年;有一天你上班却发现,你花了数周时间拼凑起来的代码已经被一个你从未交谈过的资深工程师在几个小时内重写了。感觉真不好。
Not telling the original author about the refactor is a disservice in two distinct ways. First, they have actively eroded the author’s trust. As much as we try to divorce ourselves from our code, we always leave a tiny piece of personal pride and ownership in the code that we’ve written. I’d much prefer if someone were honest with me about the shortcomings of my solution and shows me how to fix it rather than find out about the problems after they’ve already been addressed. This is particularly harmful when it comes to newer engineers. Imagine yourself just one year out of school; you come into work one day only to find that the code that you’d taken weeks to cobble together had been rewritten in a few hours by a much more senior engineer whom you’ve never talked to. It doesn’t feel great.
其次,他们可能不知道代码编写时的初始环境。当处理驱动重构者没有积极维护的代码时,这一点尤其麻烦。为什么这很重要?编程就是权衡;我们可以通过使用内存密集型数据结构来编写更快的解决方案,或者通过近似而不是进行精确计算来减少内存占用。同样,每一行“坏”代码都试图解决问题。通过盲目重构,您可能会成为原作者小心翼翼避免的错误或弱点的牺牲品。
Second, they may not be aware of the initial circumstances surrounding the code at the time it was written. This is particularly troublesome when dealing with code that the drive-by refactorer is not actively maintaining. Why is this important? Programming is all about trade-offs; we can write a faster solution by using a more memory-intensive data structure or reduce our memory footprint by approximating rather than making precise calculations. Likewise, every line of “bad” code attempted to solve a problem. By blindly refactoring it, you may fall prey to a bug or weakness the original authors were carefully trying to avoid.
不要成为被动的重构者,而要成为善意的重构者。很少重构您不主动维护的代码,并且当您这样做时,请确保您是在负责人员的意见下进行的。
Don’t be a drive-by refactorer, be a well-intentioned refactorer. Rarely refactor code that you are not actively maintaining, and when you do, make sure you’re doing it with the input of those responsible for it.
许多重构大师都提倡将重构作为使代码更易于扩展的一种手段。虽然这可以成为良好重构的明显结果,但为了未来的可塑性而重写代码可能是不明智的。在没有清楚了解直接、切实的好处的情况下花时间进行重构可能是白费力气;您的更改可能不会在相当短的时间内产生回报,在最坏的情况下,也不会在代码的生命周期内产生回报。
Many refactoring gurus advocate for refactoring as a means to render code more readily extendable. While this can be a clear outcome of a good refactor, rewriting code for the sake of future malleability is likely unwise. Time spent refactoring without a clear understanding of the immediate, tangible wins might be a wasted effort; your changes might not pay off within a reasonably short period, nor, in the absolute worst case, within the lifetime of the code.
如果您可以对代码块进行足够的更改以推进项目,那么您可能不应该对其进行重构。大多数公司都有新功能需要开发和错误修复需要发布。一般来说,这些几乎总是具有更高的优先级。除非您有一套具体的目标,并且有令人信服的论据证明它将直接影响您公司的底线,否则您的管理链将不相信。但不要沮丧!我们将在接下来的章节中帮助您为重构建立业务案例。
If you can make adequate changes to a block of code to advance your project, you probably shouldn’t be refactoring it. Most companies have new features to develop and bug fixes to ship. Generally speaking, these are almost always of higher priority. Unless you have a concrete set of goals, and a compelling argument that it will directly affect your company’s bottom line, your management chain will be unconvinced. But don’t dismay! We’ll help you build a business case for refactoring in the coming chapters.
比急需重构的代码更糟糕的是重构了一半的代码。处于不确定状态的代码会让与之交互的开发人员感到困惑。如果没有明确的时间点来完全重构代码,它就会呈现半永久性的混乱状态。读者在阅读重构中期的代码时,通常很难辨别要遵循的方向或实现,尤其是如果重构者没有留下任何评论。您甚至可能会对哪些代码将被长期采用做出错误的假设,并在即将弃用的块中实施必要的更改。这些错误会迅速堆积起来,导致您最初希望改进的代码更快、更严重地受到侵蚀。
The only thing worse than code in dire need of refactoring is code half-refactored. Code in limbo is confusing to developers interacting with it. When there is no clear point in time when the code will be fully refactored, it takes on semi-permanent disorder. It’s often difficult for the reader to discern the direction or implementation to follow when reading code mid-refactor, especially if the refactorer left no comments in their wake. You might even make an incorrect assumption about which code will be adopted long-term and implement a necessary change in a block that’s headed for deprecation. These kinds of mistakes pile up quickly, leading to faster, more serious erosion of the code you hoped to improve in the first place.
在着手重构某些东西时,请确保你有足够的时间来完成你的计划。如果没有,请尝试缩小你的更改范围,以便你仍然可以进行一些改进,但可以轻松地到达终点。不完整的重构所带来的暂时好处无法抵消未来开发人员与之交互的困惑和沮丧。
When setting out to refactor something, make sure you have enough time to see your plans through to completion. If not, try to scope down your changes so that you can still make some improvements but comfortably reach the finish line. No temporary benefits reaped from an incomplete refactor outweigh the confusion and frustration of future developers interacting with it.
现在我们已经打下了坚实的基础,可以开始理解重构的目标以及在适当的情况下,重构如何使我们成为更好的程序员,让我们用一个小例子来生动地说明这一切。这个例子的范围比我们将在本书中讨论的重构工作要小得多,但它有助于在较小的范围内说明一些概念,以便我们尽早熟悉它们。
Now that we’ve built a solid foundation with which to begin understanding the goals of refactoring and how, under the right circumstances, it can enable us to be better programmers, let’s bring it all to life with a small example. This example is much smaller in scope than the kinds of refactoring efforts we’ll be discussing in this book, but it helps illustrate some of the concepts on a smaller scale so that we can get familiar with them early.
假设我们在一所大学工作,我们开发并支持一个基本程序,助教 (TA) 可以使用该程序提交作业成绩。助教使用该程序来验证作业成绩是否在教授指定的某个范围内。这个范围是可配置的,因为教授们对作业的安排不同,所以并非所有问题集都按 0 到 100 分制评分。例如,一个包含 10 个问题的问题集。每个问题最多可得 6 分。如果您正确回答了所有问题,您的最终成绩为 60 分(满分 60 分)。如果您根本不提交作业,您将获得 0 分。
Let’s pretend we’re working at a university where we develop and support a rudimentary program that teaching assistants (TAs) use to submit assignment grades. The TAs use the program to verify that assignment grades fall within a certain range specified by the professor. This range is configurable because professors structure assignments differently, so not all problem sets are graded on a 0 to 100 point scale. Take, for example, a problem set with 10 questions. Each question is worth a maximum of 6 points. If you answer all questions correctly, your final grade is 60 out of 60. If you don’t submit the assignment at all, you’ll get 0 points.
教授使用相同的工具来确保给定作业的平均分数在预期范围内。根据我们之前的示例,假设教授希望问题集的平均分数在 42 到 48 分之间(百分比分数在 70% 到 80% 之间)。他们可以将此预期范围提供给程序,然后程序处理最终成绩并确定平均分数是否在这些范围内。
Professors use the same tool to ensure that the average score for a given assignment falls within an expected range. Given our previous example, say the professor would like the average for the problem set to be within 42 and 48 points (for a percentage score between 70% and 80%). They can provide this expected range to the program, which then processes the final grades and determines whether the average falls within those bounds.
负责该逻辑的函数被调用并如示例 1-1checkValid所示。
The function responsible for this logic is called checkValid and is shown in Example 1-1.
functioncheckValid(minimum,maximum,values,useAverage=false){letresult=false;letmin=Math.min(...values);letmax=Math.max(...values);if(useAverage){min=max=values.reduce((acc,curr)=>acc+curr,0)/values.length;}if(minimum<0||maximum>100){result=false;}elseif(!(minimum<=min)||!(maximum>=max)){result=false;}elseif(maximum>=max&&minimum<=min){result=true;}returnresult;}
functioncheckValid(minimum,maximum,values,useAverage=false){letresult=false;letmin=Math.min(...values);letmax=Math.max(...values);if(useAverage){min=max=values.reduce((acc,curr)=>acc+curr,0)/values.length;}if(minimum<0||maximum>100){result=false;}elseif(!(minimum<=min)||!(maximum>=max)){result=false;}elseif(maximum>=max&&minimum<=min){result=true;}returnresult;}
我们马上就能发现一些问题。首先,函数名称没有完全涵盖其职责。我们无法完全确定像 这样的通用名称的函数会有什么作用checkValid(尤其是如果函数声明上方没有任何文档)。其次,不清楚内联值(0, 100)代表什么。根据我们对函数预期行为的了解,我们可以推断出这些数字代表任何赋值的绝对最小和最大允许点值。在上下文中, 的最小值是0有意义的,但为什么要断言 的上限呢100?第三,逻辑难以理解;不仅有相当多的条件需要推理,内联逻辑可能很复杂,使我们很难快速推理出每种情况。乍一看,几乎不可能知道该函数是否包含错误。我们可以花大量时间列举这几行简短的代码中包含的许多问题,但为了简单起见,我们将在这里停下来。
Right off the bat, we can spot some problems. First, the function name doesn’t fully capture its responsibilities. We’re not entirely certain what to expect from a function with a generic name like checkValid (especially if there isn’t any documentation atop the function declaration). Second, it’s unclear what the inlined values (0, 100) represent. Given what we know about the function’s expected behavior, we can deduce that these numbers represent the absolute minimum- and maximum-allowed point values for any assignment. Within the context, the minimum value of 0 makes sense, but why assert an upper bound of 100? Third, the logic is difficult to follow; not only are there quite a few conditions to reason through, the inlined logic can be complex, making it difficult for us to reason through each case quickly. At a quick glance, it’s nearly impossible to know whether the function contains a bug. We could spend considerable time enumerating the many issues contained within these few short lines of code, but to keep things simple, we’ll stop here.
这么少的代码行怎么会这么难理解呢?活跃开发中的代码会定期修改以处理小的、影响较小的更改(错误修复、新功能、性能调整等)。不幸的是,这些修改会堆积起来,常常导致代码变得更长、更复杂。从代码结构中,我们可以确定函数最初 编写后可能发生的两个变化:
How could so few lines of code be so tough to understand? Code in active development is regularly modified to handle small, low-impact changes (bug fixes, new features, performance tweaks, etc.). Unfortunately, these modifications pile up, oftentimes resulting in lengthier, more convoluted code. From the code structure, we can identify two changes that probably occurred after the function was initially written:
能够对提供的一组值的平均值而不是这些值的总和执行范围验证。我可以推断,此功能后来被引入有两个原因;useAverage是一个可选的布尔参数,其默认值为false,这意味着现有的调用点不需要第四个参数。布尔参数是一种代码异味;我们很快就会解决这个问题。此外,为了方便起见,代码会覆盖min和max以反映单个新的平均值。这表明作者正在寻找最简单的方法来处理此要求,同时修改最少的代码。
The ability to perform range validation on the average of the provided set of values rather than the sum of those values. I can infer that this functionality was introduced later for two reasons; useAverage is an optional Boolean argument with a default value of false, implying that there are existing callsites that do not expect a fourth argument. Boolean arguments are a code smell; we’ll address that shortly. Further, the code overwrites both min and max to reflect the single, new average value for convenience. This indicates that the author was looking for the easiest way to handle this requirement while modifying the least amount of code.
确保提供的范围不低于0或超过100。禁止教授布置价值超过 100 分的作业似乎很奇怪,但我们现在可以假设这是有意为之的行为。虽然这不是一个确凿的线索,但我们可以猜测这种行为是事后才引入的,因为条件的位置是为了验证范围的绝对限制。为什么我们不立即验证提供的最小和最大界限是否在可接受的范围内?更改的作者可能很快就确定了一系列条件,并认为添加新条件最容易的地方是在最后。我们可以通过查看版本历史记录并希望找到带有有用提交消息的原始提交来确认我们的假设。
Ensuring that no provided range fell below 0 nor exceeded 100. It seems strange to disallow professors from creating assignments worth more than 100 points, but we can assume that this was intended behavior for now. Although it isn’t a conclusive clue, we can guess this behavior was introduced as an afterthought because of the placement of the conditional to verify the range’s absolute limits. Why would we not immediately verify that the provided minimum and maximum bounds are within the acceptable range? The author of the change likely quickly identified the series of conditionals and thought the easiest place to add a new condition would be at the very end. We could confirm our hypothesis by looking through the version history and hopefully finding the original commit with a helpful commit message.
首先,让我们简化一些if语句逻辑。我们可以通过提前返回函数结果(而不是评估每个分支并返回最终值)来轻松做到这一点。如果提供的最小值和最大值超出范围,我们也会提前返回,如示例 1-20, 100所示。
First, let’s simplify some of the if statement logic. We can easily do that by returning a result from the function early rather than evaluating every branch and returning a final value. We’ll also return early in the case that the provided minimum and maximum values fall outside the 0, 100 range, as shown in Example 1-2.
functioncheckValid(minimum,maximum,values,useAverage=false){if(minimum<0||maximum>100)returnfalse;letmin=Math.min(...values);letmax=Math.max(...values);if(useAverage){min=max=values.reduce((acc,curr)=>acc+curr,0)/values.length;}if(!(minimum<=min)||!(maximum>=max))returnfalse;if(maximum>=max&&minimum<=min)returntrue;returnfalse;}
functioncheckValid(minimum,maximum,values,useAverage=false){if(minimum<0||maximum>100)returnfalse;letmin=Math.min(...values);letmax=Math.max(...values);if(useAverage){min=max=values.reduce((acc,curr)=>acc+curr,0)/values.length;}if(!(minimum<=min)||!(maximum>=max))returnfalse;if(maximum>=max&&minimum<=min)returntrue;returnfalse;}
如果最小值或最大值超出范围,则提前返回。
Return early if the minimum or maximum is out of range.
尽可能尽早返回以简化逻辑。
Simplify the logic by returning early when we can.
现在我们开始有所进展了!让我们看看是否可以通过推理函数返回 false 的所有情况来进一步简化逻辑:计算出的最小值小于提供的最小值的情况,以及计算出的最大值大于提供的最大值的情况。我们可以通过提前失败并仅true在验证每个简单的失败情况后才返回结果来替换当前条件。示例 1-3说明了这些变化中的每一个。
Now we’re getting somewhere! Let’s see whether we can further simplify the logic by reasoning through all of the cases for which the function would return false: there’s the case where the calculated minimum is smaller than the provided minimum and the case where the calculated maximum is greater than the provided maximum. We can replace the current conditions by failing early and only returning a true result after verifying each of these simple failure cases instead. Example 1-3 illustrates each of these changes.
functioncheckValid(minimum,maximum,values,useAverage=false){if(minimum<0||maximum>100)returnfalse;letmin=Math.min(...values);letmax=Math.max(...values);if(useAverage){min=max=values.reduce((acc,curr)=>acc+curr,0)/values.length;}if(min<minimum)returnfalse;if(max>maximum)returnfalse;returntrue;}
functioncheckValid(minimum,maximum,values,useAverage=false){if(minimum<0||maximum>100)returnfalse;letmin=Math.min(...values);letmax=Math.max(...values);if(useAverage){min=max=values.reduce((acc,curr)=>acc+curr,0)/values.length;}if(min<minimum)returnfalse;if(max>maximum)returnfalse;returntrue;}
我们的下一步是将内联数字(或魔法数字)提取到具有信息名称的变量中。我们还将重命名values为grades以便清晰起见。(或者,我们可以将它们定义为与函数声明相同范围内的常量,但现在我们先保持简单。)示例 1-4演示了这些说明。
Our next step will be to extract the inlined numbers (or magic numbers) into variables with informative names. We’ll also rename values to grades for clarity. (Alternatively, we could define these as constants within the same scope as the function declaration, but we’ll keep things simple for now.) Example 1-4 demonstrates these clarifications.
functioncheckValid(minimumBound,maximumBound,grades,useAverage=false){// Valid assignments should never allow fewer than 0 pointsvarabsoluteMinimum=0;// Valid assignments should never exceed more than 100 possible pointsvarabsoluteMaximum=100;if(minimumBound<absoluteMinimum)returnfalse;if(maximumBound>absoluteMaximum)returnfalse;letmin=Math.min(...grades);letmax=Math.max(...grades);if(useAverage){min=max=grades.reduce((acc,curr)=>acc+curr,0)/grades.length;}if(min<minimumBound)returnfalse;if(max>maximumBound)returnfalse;returntrue;}
functioncheckValid(minimumBound,maximumBound,grades,useAverage=false){// Valid assignments should never allow fewer than 0 pointsvarabsoluteMinimum=0;// Valid assignments should never exceed more than 100 possible pointsvarabsoluteMaximum=100;if(minimumBound<absoluteMinimum)returnfalse;if(maximumBound>absoluteMaximum)returnfalse;letmin=Math.min(...grades);letmax=Math.max(...grades);if(useAverage){min=max=grades.reduce((acc,curr)=>acc+curr,0)/grades.length;}if(min<minimumBound)returnfalse;if(max>maximumBound)returnfalse;returntrue;}
接下来,我们可以将平均值计算提取到一个单独的函数中,如示例 1-5所示。
Next, we can extract the average calculation into a separate function, as shown in Example 1-5.
functioncheckValid(minimum,maximum,grades,useAverage=false){// Valid assignments should never allow fewer than 0 pointsvarabsoluteMinimum=0;// Valid assignments should never exceed more than 100 possible pointsvarabsoluteMaximum=100;if(minimumBound<absoluteMinimum)returnfalse;if(maximumBound>absoluteMaximum)returnfalse;letmin=Math.min(...grades);letmax=Math.max(...grades);if(useAverage){min=max=calculateAverage(grades);}if(min<minimumBound)returnfalse;if(max>maximumBound)returnfalse;returntrue;}functioncalculateAverage(grades){returngrades.reduce((acc,curr)=>acc+curr,0)/grades.length;}
functioncheckValid(minimum,maximum,grades,useAverage=false){// Valid assignments should never allow fewer than 0 pointsvarabsoluteMinimum=0;// Valid assignments should never exceed more than 100 possible pointsvarabsoluteMaximum=100;if(minimumBound<absoluteMinimum)returnfalse;if(maximumBound>absoluteMaximum)returnfalse;letmin=Math.min(...grades);letmax=Math.max(...grades);if(useAverage){min=max=calculateAverage(grades);}if(min<minimumBound)returnfalse;if(max>maximumBound)returnfalse;returntrue;}functioncalculateAverage(grades){returngrades.reduce((acc,curr)=>acc+curr,0)/grades.length;}
随着我们不断迭代解决方案,越来越明显的是,处理成绩平均值的逻辑似乎越来越不合适。接下来,我们将继续通过创建两个函数来改进我们的函数:一个函数验证成绩平均值是否在一组范围内,另一个函数验证一组中的所有成绩是否在最小值和最大值范围内。此时,我们可以通过多种方式将代码重新组织为更有针对性的函数。只要我们找到一种有效地分离两种不同情况的逻辑的方法,就没有正确或错误的答案。示例 1-6展示了进一步简化函数的一种方法checkValid。
As we iterate on our solution, it becomes more obvious that the logic to handle the average of the set of grades seems increasingly out of place. Next, we’ll continue to improve our function by creating two functions: one that verifies that the average of a set of grades fits within a set of bounds and another that verifies that all grades within a set occur within a minimum and a maximum value. We could reorganize the code into more focused functions at this point in a number of ways. There is no right or wrong answer so long as we’ve found a way to divorce the logic for the two distinct cases effectively. Example 1-6 shows one such way of further simplifying our checkValid function.
functioncheckValid(minimum,maximum,grades,useAverage=false){// Valid assignments should never allow fewer than 0 pointsvarabsoluteMinimum=0;// Valid assignments should never exceed more than 100 possible pointsvarabsoluteMaximum=100;if(minimumBound<absoluteMinimum)returnfalse;if(maximumBound>absoluteMaximum)returnfalse;letmin=Math.min(...grades);letmax=Math.max(...grades);if(useAverage){returncheckAverageInBounds(minimumBound,maximumBound,grades);}returncheckAllGradesInBounds(minimumBound,maximumBound,grades);}functioncalculateAverage(grades){returngrades.reduce((acc,curr)=>acc+curr,0)/grades.length;}functioncheckAverageInBounds(minimumBound,maximumBound,grades){varavg=calculateAverage(grades);if(avg<minimumBound)returnfalse;if(avg>maximumBound)returnfalse;returntrue;}functioncheckAllGradesInBounds(minimumBound,maximumBound,grades){varmin=Math.min(...grades);varmax=Math.max(...grades);if(min<minimumBound)returnfalse;if(max>maximumBound)returnfalse;returntrue;}
functioncheckValid(minimum,maximum,grades,useAverage=false){// Valid assignments should never allow fewer than 0 pointsvarabsoluteMinimum=0;// Valid assignments should never exceed more than 100 possible pointsvarabsoluteMaximum=100;if(minimumBound<absoluteMinimum)returnfalse;if(maximumBound>absoluteMaximum)returnfalse;letmin=Math.min(...grades);letmax=Math.max(...grades);if(useAverage){returncheckAverageInBounds(minimumBound,maximumBound,grades);}returncheckAllGradesInBounds(minimumBound,maximumBound,grades);}functioncalculateAverage(grades){returngrades.reduce((acc,curr)=>acc+curr,0)/grades.length;}functioncheckAverageInBounds(minimumBound,maximumBound,grades){varavg=calculateAverage(grades);if(avg<minimumBound)returnfalse;if(avg>maximumBound)returnfalse;returntrue;}functioncheckAllGradesInBounds(minimumBound,maximumBound,grades){varmin=Math.min(...grades);varmax=Math.max(...grades);if(min<minimumBound)returnfalse;if(max>maximumBound)returnfalse;returntrue;}
提取逻辑来确定成绩的平均值是否在其自身函数的最小值和最大值范围内。
Extract logic to determine whether the average of the grades is within minimum and maximum bounds in its own function.
提取逻辑来确定单独的函数中所有等级是否都在最小和最大范围内。
Extract logic to determine whether all the grades are within minimum and maximum bounds in a separate function.
好了!我们checkValid通过六个简单的步骤成功进行了重构。
Ta da! We’ve successfully refactored checkValid in six simple steps.
我们的新版本具有一些明显的好处。只需一眼,我们就能清楚地了解代码的目的是什么。我们还通过简化条件使其性能略有提高,并简化了容易出错的逻辑。总而言之,下一个开发人员更有可能能够轻松扩展此解决方案。这只是微观层面上的战略重构可能对您的应用程序产生的潜在积极影响的预览;现在想象一下它在大规模应用时可能产生的影响。
Our new version has some clear benefits. With just a glance, we can develop a solid sense of what the code aims to do. We’ve also made it the slightest bit more performant and simplified bug-prone logic by simplifying our conditions. All in all, the next developer is more likely to be able to extend on this solution without too much trouble. This is just a sneak peak into the potentially positive impact strategic refactoring at a microscopic level can have on your application; now imagine the impact that it can have when applied at scale.
但在我们坐下来开始认真重构之前,我们需要正确定位自己。我们需要了解我们想要改进的代码的历史,为此,我们需要了解代码是如何退化的。
But before we can sit down at our keyboards and start diligently refactoring, we need to orient ourselves properly. We need to understand the history of the code we want to improve, and for that, we need to understand how code degrades.
成功跑完一场马拉松是一项了不起的壮举。虽然我个人从未接受过这项挑战,但我的很多朋友都接受过。然而,你可能会惊讶的是,这些朋友中的大多数人在决定报名参加他们的第一次半程或全程马拉松之前并不是狂热的跑步者。通过坚持定期、可持续的训练计划,他们能够在短短几个月内建立起必要的耐力。
Successfully running a marathon is an impressive feat. While I’ve personally never taken on the challenge, quite a few of my friends have. What may surprise you, however, is that the large majority of these friends were not avid runners before deciding to sign up for their first half or full marathon. By sticking to a regular, sustainable training schedule, they were able to build up the necessary endurance in just a few months.
我的大部分朋友都已经拥有良好的身体状况,但如果你的目标是跑一场马拉松,而你目前的大部分体力活动都包括从沙发上站起来从食品柜里拿一袋薯片,那么你的身体状况就会好得多。你不仅要先锻炼出经常运动的人的心血管和身体耐力,还必须养成新的习惯,比如经常锻炼和吃健康食品(即使你只想坐在舒适的椅子上,吃一大块奶酪披萨)。
Most of my friends were already in good physical shape, but if your goal is to run a marathon and most of your current physical activity involves getting up from the couch to grab a bag of chips from your pantry, you will have a much more difficult time. Not only will you first have to build up the cardiovascular and physical endurance of a regularly active person, you’ll have to adopt new habits around habitual exercise and eating healthy food (even when all you want to do is settle into a comfy chair with a big, cheesy slice of pizza).
训练中的小波动可能会导致严重的挫折。如果你没有得到足够的睡眠,或者在炎热的天气中措手不及,你会更快疲劳,从而影响你跑完目标距离的能力。即使处于马拉松的最佳状态,你也必须为比赛当天的未知情况做好准备。可能会下雨;你的鞋带可能会断;你可能会被困在拥挤的跑步者中。你要学会掌握你可以控制的变量,但必须愿意并准备好随机应变。
Small fluctuations in training can lead to serious setbacks. If you haven’t gotten enough sleep or get caught off-guard by a scorching-hot day, you will tire more quickly, compromising your ability to run your target distance. Even in peak marathon form, you have to be prepared for the unknowns on the day of the race. It might rain; your laces might break; you might be stuck in a tight crowd of runners. You learn to master the variables you can control but must be willing and ready to think on your feet.
程序员有点像马拉松运动员。两者都需要持续努力。两者都建立在先前的进展之上,一次又一次地提交,一英里又一英里地前进。认真努力保持健康的习惯,可以让你能够在几周内恢复马拉松跑步状态或达到最佳开发速度,而不是需要几个月的时间。保持对内部和外部环境的高度警惕并进行相应调整是成功完成比赛的关键。同样的道理也适用于开发:对代码库状态和任何外部影响的高度警惕是最大限度地减少挫折并最终确保顺利到达终点线的关键。
Being a programmer is a little bit like being a marathon runner. Both take sustained effort. Both build atop preceding progress, commit by commit, mile by mile. Making an earnest effort to maintain healthy habits can make the difference between being able to get back into marathon-running shape or peak development pace in a matter of weeks and having to take months to do so. Maintaining a high level of vigilance over both your internal and external environments and adjusting accordingly is key to completing the race successfully. The same can be applied to development: a high level of vigilance over the state of the codebase and any external influences is key to minimizing setbacks and ultimately ensuring a smooth path to the finish line.
在本章中,我们将讨论为什么了解代码退化是成功重构的关键。我们将研究停滞或活跃开发中的代码,并描述每种状态下代码退化的方式,并从近期和早期的计算机科学历史中抽取一些示例。最后,我们将讨论如何尽早发现退化,以及如何完全防止退化。
In this chapter, we’ll discuss why understanding how code degrades is key to a successful refactoring effort. We’ll look at code that is either stagnant or in active development and describe ways in which each of these states can experience code degradation, with a few examples pulled from both recent and early computer science history. Finally, we’ll discuss ways in which we can detect degradation early, and how we might prevent it altogether.
当代码的实用性下降时,它就被降级了。这意味着,代码虽然曾经令人满意,但要么不再像我们希望的那样表现良好,要么从开发角度来看不那么容易阅读或使用。正是由于这些原因,降级代码非常适合重构。话虽如此,我坚信,除非你对某件事的 历史有充分的了解,否则你无法着手改进它。
Code has degraded when its perceived utility has decreased. What this means is that the code, while once satisfactory, either no longer behaves as well as we would like or isn’t as easy to read or use from a development perspective. It’s for these precise reasons that degraded code is a great candidate for refactoring. That said, I firmly believe that you cannot set out to improve something until you have a solid grasp of its history.
代码不是凭空编写的。我们今天认为是坏代码的东西在最初编写时很可能是好代码。通过花时间了解代码最初编写的环境,以及随着时间的推移,它可能如何从好变成坏,我们可以更好地了解核心问题,了解要避免的陷阱,从而更好地将其从坏代码恢复为好代码。
Code isn’t written in a vacuum. What we might deem to be bad code today was likely good code when it was originally written. By taking the time to understand the circumstances under which the code was originally written, and how, over time, it might have gone from good to bad, we can build a better awareness of the core problem, get a sense of the pitfalls to avoid, and, thus, have a better shot at taking it from bad back to good.
广义上讲,代码质量下降有两种方式。要么是代码需要做什么或如何运作的要求发生了变化,要么是您的组织一直在偷工减料,试图在短时间内取得更多成果。我们分别将这些称为“需求转变”和“技术债务”。
Broadly speaking, there are two ways in which code can degrade. Either the requirements for what the code needs to do or how it needs to behave have changed, or your organization has been cutting corners in an attempt to achieve more in a short period. We’ll refer to these as “requirement shifts” and “tech debt,” respectively.
我认为,重要的是不要假设您遇到的所有代码退化都是由于技术债务造成的,这就是为什么我们首先要研究需求变化如何使代码随着时间的推移而变得更糟。我们都有过这样的时刻,我们会遇到一些特别糟糕的代码,并想,“谁写的?我们怎么会让这种情况发生?为什么没有人修复它?”如果我们立即开始重构它,我们冒着制定解决方案的风险,该解决方案过分强调我们发现代码中最令人沮丧的地方,而不是解决其更真实、最核心的痛点。通过问自己确定自编写以来发生了什么变化,建立对代码的同理心很重要。如果我们努力寻找最初的优点,我们就会欣赏原始解决方案避免的陷阱,以及它可能处理一组约束的巧妙方法,并产生一个能够捕捉所有这些见解的重构结果。
I believe it’s important not to assume that all code degradation you run into is due to tech debt, which is why we’ll first take a look at the many ways requirement shifts can make code appear worse over time. We all have those moments when we’ll come across some particularly dreadful code and think, “Who wrote this? How could we let this happen? Why has no one fixed this?” If we begin to refactor it immediately, we risk crafting a solution that overemphasizes what we find most urgently frustrating about the code, rather than addressing its truer, core pain points. It’s important to build empathy for the code by asking ourselves to identify what has changed since it was written. If we make an effort to seek the initial good, we gain an appreciation for the pitfalls the original solution avoided, the clever ways it might have dealt with a set of constraints, and produce a refactored result that captures all these insights.
不幸的是,有时我们不得不在资源非常有限的情况下尽最大努力。当我们没有足够的时间或金钱来创造更好的解决方案时,我们就会开始偷工减料并累积技术债务。虽然这种债务最初的影响可能很小,但随着时间的推移,它对我们的代码库的负担会显著增加。人们很容易将技术债务视为糟糕的代码,但我建议您重新定义它。有时,最粗糙的解决方案是让您的产品或功能最快进入市场的解决方案;如果将您的产品交到用户手中对公司的生存至关重要,那么技术债务可能是值得的。
Unfortunately, there are times when we simply have to do our best, given very limited resources. When we don’t have enough time or money to create a better solution, we start cutting corners and accruing tech debt. While the initial impact of that debt might be minimal, its added weight on our codebases can build up significantly over time. It’s easy to dismiss tech debt as bad code, but I challenge you to reframe it. Sometimes the scrappiest solution is the one that gets your product or feature to market the fastest; if getting your product into the hands of users is critical to your company’s survival, then the tech debt might very well be worth it.
当您阅读代码降级的方式时,我鼓励您尝试在您最常使用的代码中查找每种方式的示例。您可能无法找到所有示例,但搜索代码降级症状的过程可能会让您对应用程序中最令人沮丧的部分产生新的看法。
As you read through the ways in which code can degrade, I encourage you to try to find examples of each of these in the code you work with most regularly. You might not be able to find an example for everything, but the process of searching for the symptoms of code degradation might lead you to develop a new perspective on the pieces of your application you’ve found most frustrating to work with.
一旦确定了要重构的代码,如果能和原作者坐下来谈谈,您将对原作者最初解决方案的原理和原因有宝贵的见解。通常,他们会立即告诉您代码质量下降的原因。如果作者说“我们不知道……”或“当时,我们认为……”之类的话,则很可能是由于需求变化导致代码质量下降。另一方面,如果作者说“哦,对,那个代码从来都不好”或“我们只是想赶上最后期限”,您就知道您可能正在处理标准的技术债务案例。
Once you’ve pinpointed code you’d like to refactor you will gain valuable insight into the how and why of the original authors’ initial solution if you can sit down with them. Oftentimes, they’ll be able to tell you immediately why the code degraded. If the authors say something along the lines of, “we didn’t know that…,” or, “at the time, we thought…,” you likely have a case of code degradation due to requirement shifts. On the other hand, if the authors say something like, “oh, right, that code was never any good,” or, “we were just trying to meet a deadline,” you know that you’re probably dealing with a standard case of tech debt.
每当我们编写一段新代码时,我们最好花一些时间明确定义其目的并提供详尽的文档来说明预期用途。虽然我们可能会尽力预测任何未来的需求并尝试设计能够处理这些新需求的灵活系统,但我们不太可能预测到所有即将发生的事情。应用程序周围的环境会随着时间的推移而发生不可预测的变化,这是很自然的。这些变化会不同程度地影响处于积极开发中的代码和未受影响的代码。在本节中,我们将使用处于积极开发和非积极开发中的代码库中的示例,讨论对我们的代码的要求可能超出其能力的几种方式。
Whenever we write a new chunk of code, we ideally spend some time explicitly defining its purpose and providing thorough documentation to demonstrate intended usage. While we might try our best to anticipate any future requirements and attempt to design nimble systems able to handle these new demands, it’s unlikely we’ll be able to predict everything coming down the pipe. It’s only natural that the environments around our applications will change unpredictably over time. These changes can affect both code that is in active development and code that has been left untouched to different degrees. In this section, we’ll discuss a few ways in which the demands placed on our code might exceed its abilities, using examples from codebases under active and inactive development.
我们经常尝试估计的一个要求是我们的产品需要扩展的方向和程度。这份要求清单可能相当冗长,并包含各种参数。例如,一个简单的 应用程序编程接口 (API) 请求,用于在系统中创建新的用户条目。我们可能会围绕请求的预期延迟、请求中执行的数据库查询数、每秒允许的新用户请求总数等设置一些指导方针。
One requirement we frequently attempt to estimate is the direction and degree to which our product needs to scale. This laundry list of requirements can get rather lengthy and include a wide range of parameters. Take, for instance, a simple application programming interface (API) request to create a new user entry in a system. We might set some guidelines around the expected latency of the request, the number of database queries executed within the request, the total number of new user requests allowed per second, and so on.
在推出新产品时,我们首先要考虑的是预计会有多少用户使用它。我们会设计一个我们认为可以轻松处理这个数字的解决方案(留出或留出安全误差范围)并发布它!如果我们的产品成功了,我们最终会拥有比我们最初预期多得多的用户,虽然从业务角度来看这当然是一个了不起的情况,但我们最初的实现可能无法处理这种新的、意料之外的负载。代码本身可能没有改变,但由于可扩展性要求的急剧变化,它实际上已经倒退了。
When launching a new product, one of our first assumptions deals with how many users we expect to use it. We craft a solution we think will comfortably handle that number (give or take a safe margin of error) and ship it! If our product is successful, we can end up with exponentially more users than we initially anticipated, and while that’s certainly an amazing situation to be in from a business perspective, our original implementation probably won’t be able to handle this new, unanticipated load. The code itself may not have changed, but it has effectively regressed due to a drastic shift in scalability requirements.
每个应用程序都应从第一天起就努力做到尽可能地易于访问。我们应该使用适合色盲人士的配色方案,为图像和图标添加替代文本,并确保任何交互元素都可以通过键盘访问。不幸的是,急于发布新产品或新功能的团队往往会为了更积极的发布日期而掩盖可访问性。虽然发布新功能可能有助于您留住现有用户并吸引新用户,但如果这些功能无法供预期用户群的一部分使用,您就有疏远他们的风险。一旦您的产品变得无法供某些人使用,其感知效用就会大大降低。
Every application should strive to be as accessible as possible from day one. We should use color-blind-friendly color schemes, add alternative text for images and icons, and ensure that any interactive elements are accessible via the keyboard. Unfortunately, teams hastening to ship a new product or feature often gloss over accessibility in favor of a more aggressive launch date. While shipping new features might help you retain current users and attract new ones, if these features aren’t accessible to a subset of your anticipated user base, you risk alienating them. The second your product becomes inaccessible to some, its perceived utility substantially diminishes.
尽管自 1999 年以来,Web 无障碍倡议 (WAI) 仅针对官方的 Web 无障碍最佳实践进行了几次迭代,但许多重要的修订已经标准化。每次迭代时,活跃网站和应用程序的开发人员都必须重新审视有时长期未动过的代码,并实施任何必要的更改以符合最新标准。无障碍标准的迭代可能会降低应用程序的质量。
Although few iterations on official best practices for web accessibility have been developed by the Web Accessibility Initiative (WAI) since 1999, a number of important revisions have been standardized. With every new iteration, developers of active websites and applications must revisit code sometimes long untouched and implement any necessary changes to comply with the newest standards. Iterations on accessibility standards can decrease the quality of your application.
每年,硬件公司都会发布新版本的设备;有时,他们甚至会更进一步,推出全新类型的设备。在智能手机、智能手表、智能汽车和智能电视中,我们一直在努力追赶,试图重新打包我们的应用程序,使其能够在最新硬件上无缝运行。用户已经逐渐期望他们最喜欢的应用程序可以在各种平台上运行。如果你是一款热门手机游戏的开发者,而一家大型硬件公司发布了一款具有更高屏幕分辨率的新设备,那么除非你发布专为处理更大屏幕而设计的新版游戏,否则你可能会失去很大一部分用户群。
Every year, hardware companies release new versions of their devices; sometimes, they’ll even take things a step further and introduce an entirely new class of device. Among smartphones, smart watches, smart cars, and smart TVs, we are constantly playing catch-up, attempting to repackage our applications to work seamlessly on the latest hardware. Users have grown to expect that their favorite applications work on a variety of platforms. If you’re a developer for a popular mobile game and a major hardware company releases a new device with a higher screen resolution, you risk losing a significant portion of your user base unless you ship a new version of your game built to handle the larger screen.
当程序环境发生变化时,各种意想不到的行为都可能开始显现。在现代游戏电脑配备强大的图形处理能力 (GPU) 和数十 GB 的随机存取内存 (RAM) 之前,我们只能在街机游戏厅和客厅里使用简陋的小型游戏机。游戏开发者想出了巧妙的方法,利用有限的硬件来制作《太空侵略者》和《超级马里奥兄弟》等经典游戏。当时,使用中央处理器 (CPU) 时钟速度作为游戏中的计时器是标准做法。它提供了稳定可靠的时间测量。虽然这对于游戏机游戏来说不是问题,因为游戏卡带通常与更新、更强大的游戏机版本不兼容,但对于在个人电脑上运行的游戏来说,这却是一个相当严重的疏忽。随着新型计算机时钟速度的提高,游戏速度也随之提高。想象一下,必须以两倍于正常速度堆叠俄罗斯方块或躲避一连串的蘑菇怪;在某个时候,游戏变得完全无法使用。在这两个例子中,要求代码在特定的物理硬件上运行;不幸的是,硬件已经发生了巨大的变化,结果,代码质量下降了。
When changes occur in a program’s environment, all sorts of unexpected behavior can begin to manifest. Before the age of modern gaming computers loaded with powerful graphics processing limits (GPUs) and dozens of gigabytes of random-access memory (RAM), we had humble, little gaming consoles housed in arcades and, later, our living rooms. Game developers devised clever ways to use the limited hardware available to them to build classics like Space Invaders and Super Mario Bros. At the time, it was standard practice to use the central processing unit (CPU) clock speed as a timer in the game. It provided a steady, reliable measure of time. While this wasn’t a problem for console games, where the cartridges often weren’t compatible with newer, more powerful iterations of the console, it became a rather serious oversight for games running on personal computers. As clock speed on newer computers increased, so did the speed of gameplay. Imagine having to stack Tetris pieces or avoid a stream of Goombas at twice the normal speed; at a certain point, the game becomes wholly unusable. In both of these examples, the requirement was that the code was run on specific physical hardware; unfortunately, the hardware has since changed dramatically, and as a result, the code has degraded.
这类环境变化至今仍是一个严重问题。2018 年 1 月,Google Project Zero 和 Cyberus Technology 的安全研究人员与格拉茨技术大学的团队合作,发现了两个严重的安全漏洞,影响所有 Intel x86 微处理器、IBM POWER 处理器和一些基于高级 RISC 机器 (ARM) 的微处理器。第一个漏洞 Meltdown 允许恶意进程读取机器上的所有内存,即使在未经授权的情况下也是如此。第二个漏洞 Spectre 允许攻击者利用分支预测(受影响处理器的一项性能功能)来泄露有关机器上运行的其他进程的私人数据。您可以在官方网站上阅读有关这些漏洞及其内部工作原理的更多信息。
These types of environmental changes are still a serious concern today. In January 2018, security researchers from Google Project Zero and Cyberus Technology, in collaboration with a team at the Graz University of Technology, identified two serious security vulnerabilities affecting all Intel x86 microprocessors, IBM POWER processors, and some Advanced RISC Machine (ARM)-based microprocessors. The first, Meltdown, allowed rogue processes to read all memory on a machine, even when unauthorized to do so. The second, Spectre, allowed attackers to exploit branch prediction (a performance feature of the affected processors) to reveal private data about other processes running on the machine. You can read more about these vulnerabilities and their inner workings on the official website.
在漏洞披露时,所有运行 iOS、Linux、macOS 和 Windows (非最新版本)的设备都受到了影响。许多服务器和云服务以及大多数智能设备和嵌入式设备都受到了影响。几天之内,针对这两个漏洞的软件解决方案就出现了,但这些解决方案的性能损失为 5% 到 30%,具体取决于工作负载。英特尔后来报告称,它正在努力寻找方法,以帮助在下一代处理器中防范 Meltown 和 Spectre。即使是我们认为最稳定的东西(操作系统、固件)也容易受到其自身环境变化的影响;当我们在其上运行无数应用程序的这些核心底层系统受到影响时,我们反过来也会受到影响。
At the time of the disclosure, all devices running any but the most recent versions of iOS, Linux, macOS, and Windows were affected. A number of servers and cloud services were affected, as well as the majority of smart devices and embedded devices. Within days, software workarounds became available for both vulnerabilities, but these came at a performance cost of 5 to 30 percent, depending on the workload. Intel later reported it was working to find ways to help protect against both Meltown and Spectre in its next lineup of processors. Even the things we believe to be most stable (operating systems, firmware) are susceptible to changes in their own environments; and when these core, underlying systems on top of which we run countless applications are affected, we, in turn, are affected.
每个软件都有外部依赖关系;仅举几个例子,这些依赖关系可以是一组库、一种编程语言、一个解释器或一个操作系统。这些依赖关系与软件的耦合程度可能有所不同。这种依赖关系并不是什么新鲜事;人工智能研究早期的许多有影响力的程序都是用 Lisp 和类似 Lisp 的研究编程语言开发的,因为它们在 20 世纪 60 年代和 70 年代初得到了积极的开发。SHRDLU 是一个早期的自然语言理解计算机程序,它是在 PDP-6 上的 Micro Planner 中编写的,使用了如今已不复存在的非标准宏和软件库,因此遭受了无法修复的软件腐烂。
Every piece of software has external dependencies; to list just a few examples, these can be a set of libraries, a programming language, an interpreter, or an operating system. The degree to which these dependencies are coupled to the software can vary. This reliance isn’t anything new; many influential programs from the early days of artificial intelligence research were developed in Lisp and Lisp-like research programming languages as they were actively developed in the 1960s and early 1970s. SHRDLU, an early natural language–understanding computer program, was written in Micro Planner on a PDP-6, using nonstandard macros and software libraries that no longer exist today, thus suffering from irreparable software rot.
如今,我们尽最大努力更新外部依赖项,以便及时了解最新功能和安全补丁。然而,有时我们会降低更新优先级或忽略更新,尤其是对于我们未积极维护的代码。虽然允许依赖项落后几个版本可能不是眼前的问题,但确实存在风险。我们变得更容易受到安全漏洞的影响。我们还会在未来面临可能困难的升级体验。
Today, we do our best to update our external dependencies to keep up to date with the latest features and security patches. Sometimes, however, we either deprioritize or lose track of updates, especially when it comes to code we’re not actively maintaining. While allowing dependencies to fall a few versions behind might not be an immediate problem, it does come at a risk. We become more susceptible to security vulnerabilities. We also open ourselves up to potentially difficult upgrade experiences at a later date.
假设我们正在运行一个依赖于名为 Super Timezone Library 的开源库 1.8 版本的程序。在发布 4.0 版本仅几周后,Super Timezone Library 的开发人员宣布他们将不再积极支持 3.0 以下的任何版本。我们现在至少需要升级到 3.0 版本才能继续移植安全补丁。不幸的是,2.5 版本引入了一些向后不兼容的更改,而 2.8 版本弃用了我们应用程序中广泛使用的功能。过去几年来,保持库更新本来可能是一项小额的定期投资,但现在却变成了一项更为复杂、紧迫的投资。
Say we are running a program that relies on version 1.8 of an open-source library called Super Timezone Library. Just a few weeks after releasing version 4.0, the developers of Super Timezone Library announce that they will no longer actively support any versions below 3.0. We now need to upgrade to version 3.0 at the minimum to continue to port security patches. Unfortunately, version 2.5 introduced some backward-incompatible changes and version 2.8 deprecated functionality used widely in our application. What could have been a small, regular investment in keeping the library up to date over the past few years has now turned into a much more complex, urgent investment.
需求变更可能会导致代码闲置。以一个面向公众的 API 为例。您的团队决定弃用该 API,并警告第三方开发人员即将进行的变更。不幸的是,在您传达了预期的变更、从网站上删除了文档并确保没有外部系统仍依赖该端点后,您的团队忘记删除代码。几个月后,一位新工程师开始实现一项新功能,偶然发现了已退役的 API 端点,并很自然地认为它仍然可以使用。他们决定将其重新用于自己的用例。不幸的是,他们很快发现代码并没有完全按照他们的意图运行,这仅仅是因为 API 被遗忘了,没有适应其余代码库和无数次需求变更的迭代。
Changes in requirements can lead to unused code. Take, for example, a publicly facing API. Your team decides to deprecate the API and warn third-party developers of the upcoming change. Unfortunately, after you’ve communicated the intended change, removed the documentation from your website, and ensured that no external systems were still relying on the endpoint, your team forgets to remove the code. A few months later, a new engineer begins implementing a new feature, stumbles upon the decommissioned API endpoint, and assumes, quite naturally, that it is still functional. They decide to repurpose it for their own use case. Unfortunately, they quickly find out that the code doesn’t do quite what they intended, simply because the API had been left in the dust and hadn’t adapted with the rest of the codebase and numerous iterations of requirement changes.
从开发人员的生产力角度来看,未使用的代码也会带来问题。每次遇到我们认为未使用的代码时,我们都必须非常小心地确定是否可以安全地删除它。除非我们配备了可靠的工具来帮助我们正确地突出显示死代码的范围,否则我们可能很难确定其确切的边界。如果我们不确定是否可以删除它,通常我们会继续前进,并希望其他人以后能弄清楚。谁知道有多少工程师在最终删除它之前会遇到同一段代码并问自己同样的问题!
Unused code can also be problematic from a developer productivity perspective. Every time we encounter code we believe to be unused, we have to determine very carefully whether we can safely remove it. Unless we’re equipped with reliable tooling to help us properly highlight the extent of the dead code, we might have a difficult time pinpointing its exact boundaries. If we aren’t sure whether we can delete it, usually we’ll just move on and hope someone else can figure it out later on. Who knows how many engineers will come across the same piece of code and ask themselves the same question before it’s finally removed!
最后,如果允许未使用的代码堆积起来,可能会影响性能。例如,如果您的团队负责网站的面向客户端部分,则浏览器请求的 JavaScript 文件的大小将直接影响初始页面加载时间。通常,文件越大,响应越慢。贪婪地请求臃肿的应用程序代码可能会对用户体验造成很大损害。
Finally, unused code, if allowed to pile up, can be a hindrance to performance. If, for example, your team works on the client-facing portion of a website, the size of the files the JavaScript files requested by your browser directly translates to initial page load times. Typically, the larger the file, the slower the response. Greedily requesting bloated application code can be quite detrimental to the user experience.
大多数情况下,为今天或明天的产品需求编写解决方案,解决我们了解并能轻松预见的问题和限制,比为明年编写解决方案,试图解决未知的未来陷阱更容易。我们试图务实,权衡当前问题与未来问题,并试图确定我们应该投入多少时间来解决其中任何一个问题。有时,我们只是对未来没有很好的直觉。
Most of the time, it’s easier to write a solution for today or tomorrow’s product requirements, solving for the problems and constraints we understand and can easily anticipate, than to write one for next year, attempting to solve for unknown future pitfalls. We try to be pragmatic, weighing current concerns against future concerns, and attempting to determine how much time we should invest in solving for either. Sometimes, we simply don’t have a good intuition about the future.
函数的布尔参数是一个很好的例子,说明了预测未来产品需求的难度。大多数情况下,布尔参数被引入到现有函数中以修改其行为。(我们在“我们的第一个重构示例”中看到一个,其中布尔标志用于决定我们是否想知道每个成绩或这些成绩的平均值是否在给定的范围内。)当你找到一个函数几乎完全按照你的要求执行操作(只有一点例外)时,添加布尔标志通常是你可以做的最小、最简单的更改。不幸的是,这种类型的更改可能会导致各种各样的问题。我们可以在示例 2-1中看到其中一些问题,其中我们有一个小函数负责上传图像,给定一个文件名和一个表示该文件是否是 PNG 的标志。
Boolean arguments to functions are a great example of the difficulty of predicting future product requirements in action. Most of the time, Boolean arguments are introduced to existing functions to modify their behavior. (We saw one in “Our First Refactoring Example”, where a Boolean flag was used to decide whether we wanted to know whether each of the grades or the average of those grades fell in a given range.) Adding a Boolean flag is often the smallest, simplest change you can make when you find a function that does almost exactly what you want it to do, with just a tiny exception. Unfortunately, this type of change can cause all sorts of problems down the line. We can see some of those in action in Example 2-1, where we have a small function responsible for uploading an image given a filename and a flag denoting whether the file is a PNG.
functionuploadImage(filename,isPNG){// some implementation detailsif(isPNG){// do some PNG-specific logic}// do some other things}
functionuploadImage(filename,isPNG){// some implementation detailsif(isPNG){// do some PNG-specific logic}// do some other things}
如果几个月后我们决定支持一种新的图像格式,该怎么办?我们可能会决定向 designate 添加另一个布尔参数,如示例 2-2isGIF所示。
What if, a few months from now, we decide to support a new image format? We might decide to add another Boolean argument to designate isGIF, as shown in Example 2-2.
functionuploadImage(filename,isPNG,isGIF){// some implementation detailsif(isPNG){// do some PNG-specific logic}elseif(isGIF){// do some GIF-specific logic}// do some other things}
functionuploadImage(filename,isPNG,isGIF){// some implementation detailsif(isPNG){// do some PNG-specific logic}elseif(isGIF){// do some GIF-specific logic}// do some other things}
引入了一个新的布尔参数来指定图像是否是 GIF。
Introduced a new Boolean argument to designate whether the image is a GIF.
一张图片不能同时是 PNG 和 GIF,所以我们else if在这里添加了。
An image cannot be both a PNG and a GIF, so we’ve added an else if here.
要调用此函数并正确上传 GIF,我们需要记住将第二个布尔参数设置为 true。遇到调用的代码的读者uploadImage可能会感到困惑,需要参考函数定义来了解这两个布尔参数的作用。
To call this function and correctly upload a GIF, we would need to remember to set the second Boolean argument to true. Readers who come across the code calling out to uploadImage would likely be confused and need to refer to the function definition to understand what role the two Boolean arguments play.
在具有命名参数的语言中,我们不太关心需要引用函数定义来了解参数的作用和顺序。无论选择哪种语言,它仍然uploadImage(filename=filename, isPNG=true, isGIF=true)是一个完全有效的函数调用(虽然可能看起来毫无意义,并且很可能在将来导致错误)。示例 2-3显示了一个例子,其中读者可能很难uploadImage根据上下文辨别出什么。
In a language with named arguments, we would be less concerned with needing to reference the function definition to know the role and order of arguments. Regardless of language choice, it remains that while uploadImage(filename=filename, isPNG=true, isGIF=true) may seem nonsensical, it is a perfectly valid function call (and is very likely to cause bugs in the future). Example 2-3 shows an example where it might be difficult for the reader to discern what uploadImage does given the context.
functionchangeProfilePicture(filename){// some implementation detailsif(isAnimated){uploadImage(filename,false,true);}else{uploadImage(filename,true,false);}// do some other things}
functionchangeProfilePicture(filename){// some implementation detailsif(isAnimated){uploadImage(filename,false,true);}else{uploadImage(filename,true,false);}// do some other things}
uploadImage开发人员在阅读诸如 之类的函数时不仅难以理解其工作原理changeProfilePicture,而且如果将来引入更多图像格式,这种模式将难以持续。向 support 添加第一个布尔参数的开发人员isPNG主要关心的是今天的问题,而不是明天的问题。更好的方法是将逻辑拆分为不同的函数:uploadJPG、uploadPNG和,如示例 2-4uploadGIF所示。
Not only is it difficult for developers to understand how uploadImage works when reading through functions like changeProfilePicture, it’s an unsustainable pattern to continue to maintain if more image formats are introduced in the future. The developer who added the first Boolean argument to support isPNG was mostly concerned with today’s problems rather than those of tomorrow. A better approach would be to split up the logic into distinct functions: uploadJPG, uploadPNG, and uploadGIF, as shown in Example 2-4.
functionuploadImagePreprocessing(filename){// some implementation details}functionuploadImagePostprocessing(filename){// do some other things}functionuploadJPG(filename){uploadImagePreprocessing();// do JPG thingsuploadImagePostprocessing();}functionuploadPNG(filename){uploadImagePreprocessing();// do PNG thingsuploadImagePostprocessing();}functionuploadGIF(filename){uploadImagePreprocessing();// do GIF thingsuploadImagePostprocessing();}
functionuploadImagePreprocessing(filename){// some implementation details}functionuploadImagePostprocessing(filename){// do some other things}functionuploadJPG(filename){uploadImagePreprocessing();// do JPG thingsuploadImagePostprocessing();}functionuploadPNG(filename){uploadImagePreprocessing();// do PNG thingsuploadImagePostprocessing();}functionuploadGIF(filename){uploadImagePreprocessing();// do GIF thingsuploadImagePostprocessing();}
现在你可能想知道,isPNG如果我们可以稍后重构它,为什么添加布尔参数是一个严重的问题。为了uploadImage正确替换所有出现的 ,我们需要单独审核每个调用点,并将其替换为uploadJPG或uploadPNG,具体取决于布尔参数是否设置为true。由于这些更改是手动的但很平凡,我们进行错误替换的可能性相当高,并可能导致一些严重的回归。根据问题的广泛程度以及它与其他关键业务逻辑的紧密耦合程度,重构看似简单的布尔参数可能是一项艰巨的任务。
Now you might be wondering why adding the isPNG Boolean argument is a serious problem if we can just refactor it later. To replace all occurrences of uploadImage properly, we’d need to audit each callsite individually and replace it with either uploadJPG or uploadPNG, depending on whether the Boolean argument is set to true. Because these changes are manual but mundane, the likelihood of us making the wrong replacement is quite high and could lead to some serious regressions. Depending on how widespread the problem might be, and how tightly coupled it might be to other crucial business logic, refactoring what seems like a simple Boolean argument might be a daunting task.
技术债务最常见的罪魁祸首是时间有限、工程师数量有限和资金有限。鉴于所有科技公司都面临着一个或多个方面的资源有限问题,因此每家公司都有技术债务。无论是成立六个月的小型初创公司、成立数十年的大型企业集团,还是介于两者之间的每家公司,都有相当多的糟糕代码。在本节中,我们将仔细研究这些影响如何导致技术债务的积累。虽然指责代码的原作者并责备他们做出今天看来不太理想的决定很容易,但重要的是要记住,他们是在严重的限制下运作的。我们必须承认,有时在紧张的压力下编写好的代码几乎是不可能的。
The most common culprits behind tech debt are limited time, limited numbers of engineers, and limited money. Given that all technology companies are faced with limited resources on one or more axes, each and every one of them has tech debt. Tiny, six-month-old startups; giant, decades-old conglomerates; and every company in between has a fair share of crufty code. In this section, we’ll take a closer look at how these influences can lead to the accumulation of tech debt. Although it can be easy to point a finger at the original authors of the code and admonish them for making decisions that appear suboptimal today, it’s important to remember that they were operating under serious constraints. We have to acknowledge that sometimes it’s just about impossible to write good code under tight pressure.
在实施新事物时,我们必须就使用哪些技术做出一些关键决定。我们必须选择一种语言、一个依赖项管理器、一个数据库等等。在应用程序可供任何用户使用之前,需要做出一长串决定。许多决定都是根据工程师的经验做出的;如果这些工程师更习惯使用一种技术而不是另一种技术,那么他们将比决定采用新堆栈更容易快速启动和运行项目。
When implementing something new, we have to make some critical decisions about which technologies we want to use. We have to choose a language, a dependency manager, a database, and so on. There’s a fairly long laundry list of decisions to make well before the application becomes available to any users. Many of these decisions are made given the engineers’ experience; if these engineers are more comfortable using one technology over another, they’ll have an easier time getting the project up and running quickly than if they decided to adopt a new stack.
一旦项目启动并取得一定进展,这些早期的技术决策就会受到考验。如果在应用程序的生命周期中,技术选择的问题出现得足够早,那么找到合适的替代方案并转向它可能会很容易且成本低廉,但这些选择的局限性往往在应用程序发展到这个阶段之后才会显现出来。
Once the project’s been launched and found some traction, these early technology decisions are put to the test. If a problem with a technology choice arises early enough in the lifetime of the application, it might be easy and inexpensive to find an appropriate alternative and pivot to it, but oftentimes the limitations of those choices don’t become apparent until well after the application has grown past this point.
其中一个决定可能是使用动态类型编程语言而不是静态类型编程语言来开发应用程序。 动态类型编程语言的支持者认为,它们使代码更易于阅读和理解;更少的严格定义的结构和类型声明的间接性使读者能够更好、更容易地理解代码的目的。许多人还吹嘘由于编译时间的减少,它们提供了更快的开发周期。
One such decision might be to develop an application by using a dynamically typed programming language instead of a statically typed programming language. Proponents of dynamically typed programming languages argue that they make the code easier to read and understand; less indirection around strictly defined structures and type declarations allow the reader to understand better and more readily the purpose of the code. Many also tout the quicker development cycle they provide due to the lack of compile time.
尽管使用动态类型编程语言有很多好处,但当应用程序超出临界规模时,它们就会变得难以管理。因为类型只在运行时进行验证,所以开发人员有责任通过编写一整套单元测试来确保类型的正确性,这些测试包括执行所有执行路径并断言预期的行为。如果变量名不能立即表明它可能属于哪种类型,那么新开发人员在熟悉不同结构如何相互作用时可能会遇到困难。最终需要进行防御性编程的情况并不少见,如示例 2-5所示,我们断言传递给函数的值具有某些属性并且不是无意的null。
While there are many upsides to using dynamically typed programming languages, they become difficult to manage when applications grow beyond a critical mass. Because types are only verified at runtime, it is the developer’s responsibility to ensure type correctness by writing a full suite of unit tests that exercises all execution paths and asserts expected behavior. New developers seeking to familiarize themselves with how different structures interact with one another might have a difficult time doing so if variable names do not immediately indicate which type it might be. It’s not uncommon to end up needing to program defensively, as shown in Example 2-5, where we assert that a value passed into a function has certain properties and isn’t unintentionally null.
functionaddUserToGroup(group,user){if(!user){throw'user cannot be null';}// assert required fieldsif(!user.name){throw'name required';}if(!user.){throw'email required';}if(!user.dateCreated){throw'date created required';}// assert no empty strings or other invalid valuesif(user.name===""){throw'name cannot be empty';}if(user.===""){throw'email cannot be empty';}if(user.dateCreated===0){throw'date created cannot be 0';}group.push(user);returngroup;}
functionaddUserToGroup(group,user){if(!user){throw'user cannot be null';}// assert required fieldsif(!user.name){throw'name required';}if(!user.){throw'email required';}if(!user.dateCreated){throw'date created required';}// assert no empty strings or other invalid valuesif(user.name===""){throw'name cannot be empty';}if(user.===""){throw'email cannot be empty';}if(user.dateCreated===0){throw'date created cannot be 0';}group.push(user);returngroup;}
由于 JavaScript 的动态特性,代码示例的作者很可能经常遇到无效用户在运行时通过调用堆栈的问题。作者只是想确保他们只将有效用户添加到组中,这完全可以理解。不幸的是,现在addUserToGroup主要关注的是确保提供的用户是有效的,而不是将用户添加到组中。随着对什么是有效的做出越来越多的决定user,整个代码库中的每个临时验证都需要更新。我们也有可能因为忘记更新这样一个位置而引入错误,这种情况的可能性也越来越大。最终,我们到处都会得到冗长、复杂、容易出错的函数。
It’s very likely the author of the code sample runs into issues regularly with invalid users weaving their way through a callstack at runtime simply due to the dynamic nature of JavaScript. The author just wants to be certain that they are only adding valid users to the group, and that’s completely understandable. Unfortunately, now addUserToGroup is primarily concerned with ensuring that the user provided is valid, rather than adding the user to the group. As more decisions are made about what constitutes a valid user, each of these ad hoc validations sprinkled throughout the codebase needs to be updated. There’s also an increasing chance we might introduce a bug by simply forgetting to update one such location. Eventually, we end up with lengthy, convoluted, bug-prone functions everywhere.
我们可以引入一个新函数来帮助缓解代码质量下降。假设我们编写了一个简单的帮助程序来封装验证user对象的所有逻辑;我们将其称为validateUser。示例 2-6展示了其实现。
We can introduce a new function to help mitigate code degradation. Let’s say we write up a simple helper to encapsulate all the logic for validating a user object; we’ll call it validateUser. Example 2-6 shows its implementation.
functionvalidateUser(user){if(!user){throw'user cannot be null';}// assert required fieldsif(!user.name){throw'name required';}if(!user.){throw'email required';}if(!user.dateCreated){throw'date created required';}// assert no empty strings or other invalid valuesif(user.name===""){throw'name cannot be empty';}if(user.===""){throw'email cannot be empty';}if(user.dateCreated===0){throw'date created cannot be 0';}return;}
functionvalidateUser(user){if(!user){throw'user cannot be null';}// assert required fieldsif(!user.name){throw'name required';}if(!user.){throw'email required';}if(!user.dateCreated){throw'date created required';}// assert no empty strings or other invalid valuesif(user.name===""){throw'name cannot be empty';}if(user.===""){throw'email cannot be empty';}if(user.dateCreated===0){throw'date created cannot be 0';}return;}
然后我们可以更新addUserToGroup以使用我们的新辅助函数,大大简化逻辑,如示例 2-7所示。
We can then update addUserToGroup to use our new helper function, drastically simplifying the logic, as shown in Example 2-7.
addUserToGroup没有内联验证逻辑的简化函数functionaddUserToGroup(group,user){validateUser(user);group.push(user);returngroup;}
functionaddUserToGroup(group,user){validateUser(user);group.push(user);returngroup;}
不幸的是,虽然调用 对我们来说要容易得多validateUser,但替换我们之前枚举每个检查的所有位置将是一项简单的任务。首先,我们必须识别每个位置。如果我们处理大型代码库,这可能是一项艰巨的任务。其次,在审核每个位置时,我们可能会发现一些我们忘记了一两个检查的情况。在某些情况下,这是一个错误,我们可以通过一次调用 安全地替换检查validateUser;在其他情况下,这可能是故意的,我们不能盲目地用我们的新助手替换现有代码,否则可能会引入回归。因此,减轻防御性编程的负担需要我们计划和执行大规模重构。
Unfortunately, while it’s much easier for us to call validateUser, replacing all the locations where we previously enumerated each check will be an easy task. First, we have to identify each of those spots. If we’re dealing with a large codebase, that might be a daunting task. Second, in auditing each of these locations, we’ll probably end up finding a handful of instances where we’ve forgotten a check or two. In some cases, this is a bug, and we can safely replace the checks with a single call to validateUser; in other cases, this might have been intentional, and we cannot blindly replace the existing code with our new helper at the risk of introducing a regression. As such, easing the burden of our defensive programming requires us to plan and execute a sizable refactor.
维护有序的代码库有点像保持家里整洁。似乎总有比收拾堆在梳妆台上的衣服或整理咖啡桌上堆积的邮件更重要的事情要做。但我们积累的东西越多,当我们终于有时间整理时,我们花在梳理上的时间就越多。你甚至可能会让杂物堆积到开始溢出到其他表面的地步。当我的父母鼓励我保持整洁并每天打扫一点时,他们知道这是有道理的;他们知道处理小混乱总是比处理大混乱容易得多。
Maintaining an organized codebase is a little bit like maintaining a tidy home. It seems as though there’s always something more important to do than to put away the clothes heaped over the dresser or sort through the stack of mail accumulating on the coffee table. But the more we accumulate, the more time we’ll spend combing through it all when we finally get around to it. You might even allow the clutter to build up to the point that it’s begun overflowing on to other surfaces. My parents were onto something when they encouraged me to keep things tidy and clean up just a little bit every day; they knew that it was always much easier to take care of a small mess than a massive one.
在保持代码库井然有序方面,我们中的许多人都陷入了相同的模式。例如,一个文件结构相对扁平的代码库。大多数代码被组织成二十几个文件,只有一个目录用于测试。应用程序以稳定的速度增长,每月都会添加一些新文件。因为维持现状更容易,工程师们没有主动开始将相关文件组织到目录中,而是学会了浏览日益杂乱的代码。新工程师被引入到日益混乱的环境中,发出警告,鼓励团队开始拆分代码,但这些担忧没有被理会;经理们鼓励他们专注于迫在眉睫的最后期限,资深工程师耸耸肩,向他们保证,他们很快就会找到如何在混乱中提高生产力。最终,代码库达到了临界点,持续的组织不力大大降低了整个工程团队的生产力。只有这样,团队才会花时间起草一个整理代码库的计划,此时需要考虑的变量数量将远远超过他们几个月(甚至几年)前齐心协力解决问题的数量。
Many of us fall into the same patterns when it comes to keeping our codebases organized. Take, for instance, a codebase with a relatively flat file structure. Most of the code is organized into two dozen or so files, with a single directory for tests. The application grows at a steady pace, with a few new files added every month. Because it’s easier to maintain the status quo, instead of proactively beginning to organize related files into directories, engineers instead learn to navigate the increasingly sprawling code. New engineers introduced to the growing chaos raise a warning flag and encourage the team to begin splitting up the code, but these concerns fall on deaf ears; managers encourage them to focus on the deadlines looming ahead, and tenured engineers shrug and reassure them that they’ll quickly figure out how to be productive in the disarray. Eventually, the codebase reaches a critical mass in which the persistent lack of organization has dramatically slowed productivity across the engineering team. Only then does the team take the time to draft a plan for grooming the codebase, at which point the number of variables to consider is far greater than it would have been, had they made a concerted effort to tackle the problem months (or even years) earlier.
如果不加以控制,快速迭代和产品开发会迅速降低软件质量。在紧迫的期限内开发新产品功能时,我们往往会偷工减料:我们会省略一些测试用例,给变量起通用名称,或者在本来if可以创建新函数的地方添加一些语句。如果我们没有在达到目标期限后立即正确记录偷工减料的情况并分配必要的时间来纠正它们,它们就会堆积起来。很快,您就会得到非常冗长的函数,其中充斥着分支逻辑,并且整个代码库中几乎没有单元测试覆盖率。在更复杂的应用程序中,多个团队同时迭代不同的功能,行动过快的影响开始累积。除非每个团队都能有效地与其他团队沟通产品变化,否则垃圾会堆积起来。您可以在图 2-1中看到这种复合效应的示例。
Rapid iteration and product development can swiftly degrade software quality if not kept in check. When building out new product features under aggressive deadlines, we tend to cut corners: we’ll omit a few test cases, give variables generic names, or add a few if statements where we could have made a new function. If we do not properly make note of the corners we’ve cut and allocate the time necessary to correct them immediately after we’ve met our target deadline, they pile up. Soon, you end up with exceedingly lengthy functions, littered with branching logic and little-to-no unit test coverage sprinkled throughout your codebase. When working in more complex applications, where multiple teams are iterating on distinct features alongside one another, effects of moving too quickly begin to compound. Unless every team can communicate product changes effectively with every other team, the amount of cruft piles up. You can see an example of that compounding effect illustrated in Figure 2-1.
我们许多从事现代应用程序开发的人都实践持续集成和交付;我们尽可能频繁地将更改合并回主分支,在那里通过对应用程序的新版本运行自动测试来验证这些更改。我们通过将这些更改置于功能标志(也称为功能切换)后面来确保客户不会接触到半生不熟的功能和部分错误修复。虽然这些功能在积极开发期间为我们提供了很大的灵活性,但一旦我们成功地将更改引入所有用户,它们就很容易被遗忘。
Many of us working on modern applications practice continuous integration and delivery; we merge our changes back into the main branch as often as possible, where they’re validated by running automated tests against a new build of the application. We ensure that customers aren’t exposed to half-baked features and partial bug fixes by gating these changes behind feature flags (otherwise known as feature toggles). While these give us a good amount of flexibility during active development, they’re easy to forget about once we’ve successfully introduced the change to all the users.
我工作过的每家公司都有数十个(甚至数百个)功能标志,尽管这些标志已为整个生产启用,但仍在程序中被引用。虽然保留其中一些检查似乎没什么问题,但存在一些明显的风险。
Every company I’ve worked for had dozens (if not hundreds) of feature flags still being referenced in the program well after they’d been enabled for all of production. While it might seem benign to leave a few of these checks lying around, there are some distinct risks.
首先,这会给阅读代码的开发人员增加认知负担;如果开发人员不花时间验证功能的状态,他们可能会误以为该功能仍在积极开发中,并且只会在非门控代码路径中做出重要更改。其次,花时间确定该功能是否在生产中处于活动状态,却发现它已经向所有人开放了数周,这可能会令人沮丧。在严重的情况下,如果有数百个基本上失效的功能标志,这可能会对应用程序的性能产生非常严重的影响。验证给定请求或代码路径的每个与功能相关的条件所花费的累计时间可能非常长。通过清理过时的标志,我们可能都会看到一些性能增强。
First, it causes added cognitive load on developers reading the code; if the developer doesn’t take the time to verify the status of the feature, they might be misled into thinking it is still under active development and only make an important change in the nongated codepath. Second, it can be frustrating to spend time determining whether the feature is active in production, only to find out that it’s been live to everyone for weeks. In the severe cases where there are hundreds of essentially defunct feature flags, this can have a very serious performance impact on the application. The cumulative time spent validating each feature-related conditional for a given request or codepath can be significant. We might all see some performance enhancements by cleaning up our obsolete flags.
代码退化是不可避免的。无论我们多么努力地避免,我们的应用程序都需要适应需求的变化。我们可以尝试在压力下尽量减少开发,但有时我们需要偷工减料以快速交付并为我们的业务带来竞争优势。如果代码退化是不可避免的,那么大规模重构同样是不可避免的。我们总是需要解决代码库中棘手的系统性问题。如果我们认为我们已经到了这样的地步,即退化负担过重,阻碍了我们的工程团队尽可能地开发,那么我们需要戴上安全帽,弄清楚为什么以及如何走到这一步。
Code degradation is inevitable. No matter how hard we try to avoid them, there will be shifts in requirements our applications will need to adapt to. We can try to minimize development under pressure, but sometimes we need to cut corners to ship quickly and give our business the competitive advantage. If code degradation is inevitable, then refactoring at scale is equally inevitable. There will always be a need for us to address tricky, systemic problems in our codebases. If we think we’ve reached the point that we think the degradation is just too burdensome and preventing our engineering team from developing as well as it could, then we need to put on our hard hats and figure out both why and how we got to this point.
当我们学会超越代码的直接问题,转而去理解代码最初编写的环境时,我们就会开始明白代码本身并不坏。我们建立同理心,并利用这种新发现的视角来识别代码真正的基本问题,并制定一个以最佳方式改进它的计划。把这个过程想象成代码考古学中的一项大型练习!
When we learn to see beyond code’s immediate problems and instead seek to understand the circumstances under which it was originally written, we begin to see that code isn’t inherently bad. We build empathy and use this newfound perspective to identify the code’s true foundational problems and hatch a plan to improve it in the best way possible. Think of this process as just one big exercise in code archaeology!
现在我们已经了解了代码退化是如何发生的,我们必须学习如何正确地量化它,以便其他人理解。我们必须利用我们的直觉,即退化已经到了关键点,以及我们对退化原因和方式的了解,找出将问题提炼为一组指标的最佳方法,我们可以使用它们来说服其他人,这实际上是一个严重的问题。下一章将讨论一些可用于衡量代码库中的问题并为重构工作建立坚实基础的技术。
Now that we’ve learned how code degrades, we have to learn how to quantify it properly for others to understand. We have to use our hunch that the degradation is at a critical point, our knowledge about why and how it got to that point, to figure out the best way to distill the problem into a set of metrics we can use to convince others that this is, in fact, a serious problem. The next chapter discusses a number of techniques you can use to measure problems in your codebase and establish a solid baseline for your refactoring effort.
每年春天,我都会花时间清理衣柜,重新评估我拥有的所有衣服。虽然有些人选择像近藤麻理惠那样清理衣柜,看看每件衣服是否“能带来快乐”,但我采取的方法更有条理。每年,当我开始这个过程时,我知道到最后,会有一些衣服被放到捐赠堆里。我不知道这些会是哪些,因为这完全取决于我所有的衣服一开始是如何搭配的。
Every spring, I take the time to clean out my closet and reevaluate all of the clothing I own. While some opt for a Marie Kondo–like approach to cleaning out their closets, seeing whether each item “sparks joy,” I take a more methodical one. Each year, when I kick off the process, I know that by the end, a number of items will be in the donate pile. What I don’t know is which pieces these will be, because it entirely depends on how all of my clothing works together in the first place.
在开始为 Goodwill 打包行李之前,我会全面检查一下。我会按衣服类型整理所有物品:毛衣放在一堆,连衣裙放在另一堆,等等,同时考虑每件衣服的实用性。这件连衣裙适合哪个季节穿?它有多舒适?过去一年我穿过多少次?接下来,我会估算这件衣服可以搭配多少套服装。只有当我对自己拥有的所有东西有了深刻的认识,并了解每件衣服在衣柜里的作用后,我才能开始确定我可以放心捐赠的衣服。
Before I start packing some bags for Goodwill, I take a comprehensive look at the whole. I organize everything by clothing type: sweaters in one pile, dresses in another, and so on, accounting for the practicality of each item of clothing as I go. Which seasons is this dress good for? How comfortable is it? How often have I worn it in the past year? Next, I approximate how many outfits the item can be integrated with. It’s only once I have a strong sense of everything I own, and understand the role each item of clothing plays in my closet, that I can start to identify the clothing I can comfortably donate.
同样的逻辑也适用于大规模的重构工作;只有当我们对想要改进的表面区域有了明确的描述后,我们才能开始确定改进的最佳方法。不幸的是,找到有意义的方法来衡量我们今天代码中的痛点比对衣柜里的衣服进行分类要困难得多。本章讨论了在开始重构之前量化和限定代码状态的多种技术。我们将介绍一些众所周知的技术以及一些更新、更具创造性的方法。在本章结束时,我希望您能找到一种(或多种)方法来衡量您想要改进的代码,从而突出您想要解决的问题。
The same logic applies to large refactoring efforts; only once we have a solid characterization of the surface area we want to improve can we begin to identify the best way to improve it. Unfortunately, finding meaningful ways of measuring the pain points in our code today is much more difficult than categorizing items of clothing in our closets. This chapter discusses a number of techniques for quantifying and qualifying the state of our code before we begin refactoring. We’ll cover a few well-known techniques as well as a few newer, more creative approaches. By the end of the chapter, I hope you’ll have found one (or more) ways to measure the code you want to improve in a way that highlights the problems you want to solve.
衡量代码库健康状况的方法有很多种。然而,这些指标中的许多可能不会因为大规模重构而朝着积极的方向发展,因为它们与项目旨在解决的痛点无关。因此,在衡量代码库的初始状态时,我们希望选择一个我们认为能够很好地概括问题并准确突出重构影响的指标。
There are a number of ways to measure the health of a codebase. Many of these metrics, however, might not move in a positive direction as a result of a large-scale refactor simply because they are orthogonal to the pain points the project aims to address. So, in measuring the starting state of our codebase, we want to choose a metric that we believe will summarize the problem well and accurately highlight the impact of our refactor.
衡量任何重构工作的影响都是很棘手的,主要是因为如果成功执行,重构对用户来说应该是不可见的,并且不会导致任何行为变化。这不是我们希望推动用户采用或调整的新功能。我们经常投入大量精力来监控应用程序的关键部分,以确保我们的用户在使用我们的产品时获得可靠的体验,但由于这些指标捕捉了我们的用户可能注意到的行为,因此当我们正确重构时,大多数行为都不会受到影响。为了最好地描述重构的影响,我们需要确定衡量我们想要改进的代码的精确方面的指标,并在继续进行之前建立强大的基线。
Measuring the impact of any refactoring effort is tricky, primarily because when executed successfully, refactoring should be invisible to users and lead to no behavioral changes whatsoever. This isn’t a new feature we’re hoping will drive user adoption or a tweak. We often put a great deal of effort into monitoring critical pieces of our applications to ensure that our users are getting a reliable experience when using our product, but because these metrics capture behavior that our users are likely to notice, most of them remain unaffected when we’ve refactored correctly. To characterize the impact of a refactor best, we need to identify metrics that measure the precise aspects of the code we want to improve and establish a strong baseline before moving forward.
大规模的重构工作尤其难以衡量,因为它们很少在短短几周内完成。通常,从头到尾所涉及的工作远远超出了典型的功能开发周期,除非在重构工作进行期间完全暂停产品开发,否则可能很难将其影响与应用程序同一部分的其他开发人员的工作区分开来。依靠一些不同的指标可以帮助您更全面地了解您的进度,并更好地区分您的更改与与您一起迭代产品的其他开发人员引入的更改。
Large refactoring efforts are particularly difficult to measure because they rarely take place in the span of just a few weeks. More often than not, the work involved from start to finish spans far beyond the typical feature development cycle, and unless product development was completely paused while the refactoring effort was ongoing, it might be difficult to isolate its impact from the work of other developers in the same section of the application. Reliance on a handful of distinct metrics can help you paint a more holistic picture of your progress and better distinguish your changes from those introduced by other developers iterating on the product alongside you.
我们中的许多人都希望通过重构来提高开发人员的工作效率,从而让我们能够更轻松地继续维护应用程序并构建新功能。实际上,这通常意味着简化复杂、令人费解的代码部分。鉴于我们的目标是降低代码复杂度,我们需要找到一种有意义的方法来衡量它。量化代码的复杂度为我们提供了一个起点,我们可以从此开始评估我们的进度。
Many of us are motivated to refactor as a means of boosting developer productivity, making it easier for us to continue to maintain our applications and build new features. In practice, this often means simplifying complex, convoluted sections of code. Given that our goal revolves around decreasing code complexity, we need to find a meaningful way of measuring it. Quantifying the code’s complexity gives us a starting point from which we can begin to assess our progress.
测量软件复杂度有两种主要方法。首先,如果我们的代码驻留在版本历史记录中,我们可以轻松地穿越时间并在任何间隔应用我们的复杂度计算。其次,许多编程语言中都有大量开源库和工具可供使用。为整个应用程序生成报告可以像安装包和运行单个 命令一样简单。
Measuring software complexity is easy in two main ways. First, if our code resides in version history, we can easily travel through time and apply our complexity calculations at any interval. Second, a vast number of open-source libraries and tools are readily available in many programming languages. Generating a report for your entire application can be as simple as installing a package and running a single command.
在这里,我们将讨论三种常见的计算代码复杂度的方法。
Here, we’ll discuss three common methods of calculating code complexity.
1975 年,莫里斯·哈尔斯特德 (Maurice Halstead) 首次提出通过计算给定计算机程序中的运算符和操作数的数量来衡量软件的复杂性。他认为,由于程序主要由这两个单元组成,因此计算它们的唯一实例可能为我们提供有关程序大小的有意义的度量,从而表明其复杂性。
Maurice Halstead first proposed measuring the complexity of software in 1975 by counting the number of operators and operands in a given computer program. He believed that because programs mainly consisted of these two units, counting their unique instances might give us a meaningful measure of the size of the program and therefore indicate something about its complexity.
运算符是一种行为类似于函数的结构,但在语法或语义上与典型函数不同。这些运算符包括算术符号(如-and +)、逻辑运算符(如&&)、比较运算符(如>)和赋值运算符(如=)。例如,举个例子,一个简单的函数将两个数字相加,如示例 3-1所示。
Operators are constructs that behave like functions, but differ syntactically or semantically from typical functions. These include arithmetic symbols like - and +, logical operators like &&, comparison operators like >, and assignment operators like =. Take, for instance, a simple function that adds two numbers together, as shown in Example 3-1.
functionadd(x,y){returnx+y;}
functionadd(x,y){returnx+y;}
它包含一个运算符,即加法运算符+。另一方面,操作数是我们使用一组运算符进行操作的任何实体。在我们的加法示例中,我们的操作数是x和y。
It contains a single operator, the addition operator, +. Operands, on the other hand, are any entities we operate on, using our set of operators. In our addition example, our operands are x and y.
鉴于这些简单的数据点,Halstead 提出了一组指标来计算一组特征:
Given these simple data points, Halstead proposed a set of metrics to calculate a set of characteristics:
程序的容量,或者代码读者需要吸收多少信息才能理解其含义。
A program’s volume, or how much information the reader of the code has to absorb in order to understand its meaning.
程序的难度,或者重新创建软件所需的脑力劳动量;也通常称为 Halstead 工作量指标。
A program’s difficulty, or the amount of mental effort required to re-create the software; also commonly referred to as the Halstead effort metric.
您可能在系统中发现的错误数量。
The number of bugs you are likely to find in the system.
为了更好地说明 Halstead 的思想,我们可以将运算符和操作数计数技术应用到一个稍微复杂一些的函数上,该函数计算整数的素因数,如示例 3-2所示。我们在表 3-1中列举了每个唯一的运算符和操作数,以及它们在程序中出现的次数。
To illustrate Halstead’s ideas better, we can apply our operator and operand counting technique to a slightly more complicated function, which calculates an integer’s prime factors, as in Example 3-2. We’ve enumerated each of the unique operators and operands, along with the number of times they occur in the program, in Table 3-1.
functionprimeFactors(number){functionisPrime(number){for(leti=2;i<=Math.sqrt(number);i++){if(number%i===0)returnfalse;}returntrue;}constresult=[];for(leti=2;i<=number;i++){while(isPrime(i)&&number%i===0){if(!result.includes(i))result.push(i);number/=i;}}returnresult;}
functionprimeFactors(number){functionisPrime(number){for(leti=2;i<=Math.sqrt(number);i++){if(number%i===0)returnfalse;}returntrue;}constresult=[];for(leti=2;i<=number;i++){while(isPrime(i)&&number%i===0){if(!result.includes(i))result.push(i);number/=i;}}returnresult;}
| 操作员 | 发生次数 | 操作数 | 发生次数 |
|---|---|---|---|
|
2 2 |
|
2 2 |
|
2 2 |
|
2 2 |
|
2 2 |
|
1 1 |
|
3 3 |
|
7 7 |
|
2 2 |
|
2 2 |
|
4 4 |
|
12 12 |
|
3 3 |
|
1 1 |
|
2 2 |
|
1 1 |
|
2 2 |
|
1 1 |
|
2 2 |
|
1 1 |
|
2 2 |
|
4 4 |
|
3 3 |
|
1 1 |
|
1 1 |
|
1 1 |
|
1 1 |
|
1 1 |
|
1 1 |
||
|
1 1 |
||
|
1 1 |
||
|
1 1 |
||
唯一运营商:18 Unique operators: 18 |
操作员总数:35 Total operators: 35 |
唯一操作数:14 Unique operands: 14 |
总操作数:37 Total operands: 37 |
鉴于我们的质因数分解程序有 18 个唯一运算符 (n1 )、14 个唯一操作数 (n2 )以及总操作数 37 (N2 ),我们可以使用 Halstead 难度测量法来计算与读取程序相关的相对难度,其基本方程为:
Given that our prime factorization program has 18 unique operators (n1), 14 unique operands (n2), and a total operand count of 37 (N2), we can use Halstead’s difficulty measure to calculate the relative difficulty associated with reading the program with the basic equation:
代入我们的数值,我们得到总体难度分数为 23.78。
Substituting in our values, we obtain an overall difficulty score of 23.78.
虽然 23.78 本身可能意义不大,但通过处理代码的各个部分,我们可以逐渐了解该分数如何映射到我们的体验中。随着时间的推移,通过反复接触这些值及其实现,我们能够更好地解释 23.78 的分数在我们应用程序的更大背景下意味着什么。
Although 23.78 might not signify much on its own, we can gradually acquire an understanding of how this score maps to our experiences, working with individual sections of our code. Over time, through repeated exposure to these values alongside their implementations, we become better able to interpret what a score of 23.78 signifies within the greater context of our application.
本节中描述的三个不同指标中的每一个都可以以不同的尺度生成;它们可以量化单个函数或完整模块的复杂性。例如,您可以通过将文件中包含的各个函数的难度相加来计算整个文件的 Halstead 难度指标。
Each of the three distinct metrics described in this section can be generated at different scales; they can quantify the complexity of a single function or a complete module. You can calculate the Halstead difficulty metric for an entire file, for instance, by summing up the difficulties of the individual functions contained within it.
圈复杂度由 Thomas McCabe 于 1976 年开发,是对程序源代码中线性独立路径数量的定量度量。它本质上是对程序中控制流语句数量的计数。这包括if语句、while循环for和case侧switch块中的语句。
Developed by Thomas McCabe in 1976, cyclomatic complexity is a quantitative measure of the number of linearly independent paths through a program’s source code. It is essentially a count of the number of control flow statements within a program. This includes if statements, while and for loops, and case statements in side switch blocks.
以一个没有控制流组件的简单程序为例,如示例 3-3所示。为了计算其循环复杂度,我们首先为函数声明分配 1,并随着遇到的每个决策点递增。示例 3-3 的循环复杂度为 1,因为该函数只有一条路径。
Take, for example, a simple program with no control flow components, as shown in Example 3-3. To calculate its cyclomatic complexity, we first assign 1 for the function declaration, incrementing with every decision point we encounter. Example 3-3 has a cyclomatic complexity of 1 because there is only one path through the function.
functionconvertToFahrenheit(celsius){returncelsius*(9/5)+32;}
functionconvertToFahrenheit(celsius){returncelsius*(9/5)+32;}
让我们看一个更复杂的例子,比如示例 3-2primeFactors中的函数。在示例 3-4中,我们将其简化并枚举每个控制流点,以得出循环复杂度为 6。
Let’s look at a more complex example, like our primeFactors function from Example 3-2. In Example 3-4, we reduce it and enumerate each of the control flow points to yield a cyclomatic complexity of 6.
functionprimeFactors(number){functionisPrime(number){for(leti=2;i<=Math.sqrt(number);i++){if(number%i===0)returnfalse;}returntrue;}constresult=[];for(leti=2;i<=number;i++){while(isPrime(i)&&number%i===0){if(!result.includes(i))result.push(i);number/=i;}}returnresult;}
functionprimeFactors(number){functionisPrime(number){for(leti=2;i<=Math.sqrt(number);i++){if(number%i===0)returnfalse;}returntrue;}constresult=[];for(leti=2;i<=number;i++){while(isPrime(i)&&number%i===0){if(!result.includes(i))result.push(i);number/=i;}}returnresult;}
函数声明是第一个控制流点。
Function declaration is the first control flow point.
第一个for循环是我们的第二点。
First for loop is our second point.
第一条if陈述是我们的第三点。
First if statement is our third point.
第二个for循环是第四个点。
Second for loop is the fourth point.
while是第五点。
while is the fifth point.
第二if是第六点。
Second if is the sixth point.
当我们阅读一段代码时,每次出现分支(语句if、for循环等)时,我们都必须开始推理具有多条执行路径的多个状态。我们必须能够在头脑中保存更多信息才能理解代码的作用。因此,如果循环复杂度为 6,我们可以推断
primeFactors阅读和理解代码可能并不太难。
When we’re reading a chunk of code, every time there is a branch (an if statement, a for loop, etc.), we have to begin to reason about multiple states with multiple paths of execution. We have to be able to hold more information in our heads to understand what the code does. So, with a cyclomatic complexity of 6, we can infer that
primeFactors is probably not too difficult to read and understand.
计算程序中的决策点数量是 McCabe 提出的计算程序复杂度方法的简化版。从数学上讲,我们可以通过生成表示程序控制流的有向图来计算结构化程序的圈复杂度;每个节点代表一个基本块(即没有分支的直线代码序列),如果有一种方法可以从一个块传递到另一个块,则有一条边将它们连接起来。给定此图,其复杂度M定义如下式,其中E是边的数量,N是节点的数量,P是连接组件的数量,其中连接组件是所有节点都可以彼此到达的子图。
Counting the number of decision points in a program is a simplification of McCabe’s proposed method of calculating its complexity. Mathematically, we can calculate the cyclomatic complexity of a structured program by generating a directed graph representing its control flow; each node represents a basic block (i.e., a straight-line code sequence with no branches), with an edge linking them if there is a way to pass from one block to the other. Given this graph, its complexity, M, is defined as in the following equation, where E is the number of edges, N is the number of nodes, and P is the number of connection components, where a connected component is a subgraph where the nodes are all reachable from one another.
图 3-1显示了 的一个示例控制流primeFactors。
Figure 3-1 shows an example control flow for primeFactors.
primeFactors,蓝色节点表示非终止状态,红色节点表示终止状态。在这个例子中,我们有 13 条边、11 个节点和 2 个连通分量。NPath 复杂度是 Brian Nejmeh 于 1988 年提出的,作为现有复杂度指标的替代方案。他认为,专注于非循环执行路径并不能充分模拟路径的有限子集与所有可能执行路径集之间的关系。我们可以从循环复杂度不考虑控制流结构的嵌套这一事实中观察到这一限制。for连续三个循环的函数将产生与三个嵌套循环相同的度量for。嵌套会影响函数的心理复杂性,而心理复杂性会对我们维护软件质量的能力产生很大影响。
NPath complexity was proposed as an alternative to existing complexity metrics in 1988 by Brian Nejmeh. He argues that focusing on acyclic execution paths did not adequately model the relationship between finite subsets of paths and the set of all possible execution paths. We can observe this limitation in the fact that cyclomatic complexity does not consider nesting of control flow structures. A function with three for loops in succession will yield the same metric as one with three nested for loops. Nesting can influence the psychological complexity of the function, and psychological complexity can have a large impact on our ability to maintain software quality.
McCabe 的度量可能很容易计算,但它无法区分不同类型的控制流结构,将if语句while与for循环等同对待。Nejmeh 断言并非所有控制流结构都是相同的;有些控制流结构比其他结构更难理解和正确使用。例如,while对于开发人员来说,循环可能比语句更难推理switch。NPath 复杂度试图解决这个问题。不幸的是,这使得计算起来有点困难,即使对于小程序也是如此,因为计算是递归的,并且很快就会膨胀。我们将通过几个带有语句的示例来介绍计算方法,if以熟悉其工作原理。如果您想更好地了解如何计算 NPath 复杂度,给定更大范围的控制流语句(包括嵌套控制流),我强烈建议您阅读 Nejmeh 的论文。
McCabe’s metric might be easy to calculate, but it fails to distinguish between different kinds of control flow structures, treating if statements identically to while or for loops. Nejmeh asserts that not all control flow structures are equal; some are more difficult to understand use properly than others. For example, a while loop might be trickier for a developer to reason about than a switch statement. NPath complexity attempts to address this concern. Unfortunately, this makes it a bit more difficult to calculate, even for small programs, because the calculation is recursive and can quickly balloon. We’ll walk through the calculations for a few examples with if statements to get familiar with how it works. If you’d like to gain a better understanding of how to calculate NPath complexity, given a greater range of control flow statements (including nested control flows), I highly recommend reading Nejmeh’s paper.
控制流指标可以帮助您确定代码所需的测试用例数量。循环复杂度提供了下限,而 NPath 复杂度提供了上限。例如,对于primeFactors,循环复杂度表示我们需要至少六个测试用例来执行每个决策点。
Control flow metrics can help you determine the number of test cases your code needs. Cyclomatic complexity offers a lower bound, and NPath complexity provides an upper bound. For instance, with primeFactors, cyclomatic complexity indicates that we would want at least six test cases to exercise each of the decision points.
NPath 复杂度的基本情况与示例 3-3中先前的温度转换函数相同;对于没有决策点的简单程序,NPath 复杂度为 1。为了说明度量的乘法成分,我们将看一个具有几个嵌套条件的简单函数if。
Our base case for NPath complexity is the same as for our previous temperature converter function in Example 3-3; for a simple program with no decision points, the NPath complexity is 1. To illustrate the multiplicative component of the metric, we’ll take a look at a simple function with a few nested if conditions.
示例 3-5展示了一个简短的函数,该函数在给定速度的情况下返回收到超速罚单的可能性。阅读该函数,我们到达第一条if语句,此时给定的速度可以小于或大于 45 公里/小时。然后有两种可能的路径:如果速度大于 45 公里/小时,我们将代码输入if块中;如果不是,我们只需继续。接下来我们需要验证速度是否比提供的速度限制高出 10 公里/小时,此时我们再次通过代码获得两种可能路径。最终,我们返回计算出的风险因子。
Example 3-5 shows a short function that returns the likelihood of receiving a speeding ticket, given a provided speed. Reading through the function, we reach a first if statement, at which point the given speed can either be less than or greater than 45 km/h. There are then two possible paths: if the speed is greater than 45 km/h, we enter the code inside the if block; if not, we simply continue. We next need to verify whether the speed is greater than 10 km/h over the supplied speed limit, at which point we again have two possible paths through the code. Eventually, we return our calculated risk factor.
if语句的短函数,其中不同部分分别被注释为 A、B、C、D、E 和 FfunctionlikelihoodOfSpeedingTicket(currentSpeed,limit){risk=0;// Aif(currentSpeed<45){risk=1;// B}// Cif(currentSpeed>(limit+10)){risk=2;// D}// Ereturnrisk;// F}
functionlikelihoodOfSpeedingTicket(currentSpeed,limit){risk=0;// Aif(currentSpeed<45){risk=1;// B}// Cif(currentSpeed>(limit+10)){risk=2;// D}// Ereturnrisk;// F}
NPath 复杂度计算函数中不同路径的数量。我们可以通过使用一系列值进行调用来枚举每条路径,并执行每组条件。我们将一起遍历一个输入,突出显示我们遍历函数的路径。所有其他唯一路径均在表 3-2likelihoodOfSpeedingTicket中进行了标记。
NPath complexity calculates the number of distinct paths through a function. We can enumerate each of these paths by calling likelihoodOfSpeedingTicket with a range of values, exercising each set of conditions. We’ll walk through one input together, highlighting the path we traverse through the function. All other unique paths are labeled in Table 3-2.
| 输入 | 小路 |
|---|---|
|
A、B、D、F A, B, D, F |
|
A、B、E、F A, B, E, F |
|
A、C、D、F A, C, D, F |
|
A、C、E、F A, C, E, F |
唯一路径:4 Unique paths: 4 |
假设我们likelihoodOfSpeedingTicket用currentSpeedof30和limitof调用0。第一个if语句的计算结果为true,从而得出 B。第二个if语句的计算结果也为 true ,从而得出 D。然后我们在 F 处到达 return 语句。对各种输入重复此模式,我们确定该函数有四条唯一路径。因此,我们的 NPath 分数为 4。
Say we call likelihoodOfSpeedingTicket with a currentSpeed of 30 and limit of 0. Our first if statement will evaluate to true, leading us to B. Our second if statement will also evaluate to true, leading us to D. Then we reach our return statement at F. Repeating this pattern for a variety of inputs, we determine that there are four unique paths through the function. Therefore, our NPath score is 4.
一些简单的重构形式不会对您的 CFG 指标产生任何影响。由于业务逻辑复杂,某些复杂性是不可避免的。您必须进行这些检查和迭代,以确保您的应用程序正在执行它需要做的事情。当您要重构的代码涉及简化不必要的复杂逻辑时,NPath 或圈复杂度是很好的选择。如果不是,那么我建议使用一组不同的指标。但请注意,即使您正在解开一些意大利面条式代码,NPath 或圈复杂度也不应该是您的唯一指标;您将无法仅通过一个数据点全面正确地描述重构工作的影响。
Some easy forms of refactoring won’t have any impact on your CFG metrics. Some complexity is unavoidable simply due to complicated business logic. You have to make each of these checks and iterations to ensure that your application is doing what it needs to be doing. When the code you want to refactor involves simplifying unnecessarily complicated logic, then NPath or cyclomatic complexity are great options. If not, then I recommend using a different set of metrics. Do be mindful, however, that even if you are detangling some spaghetti code, NPath or cyclomatic complexity should not be your only metrics; you won’t be able to characterize the impact of your refactoring effort holistically and properly with only a single data point.
不幸的是,控制流图指标可能很难(有时成本高昂)计算,特别是对于非常大的代码库(这正是我们希望改进的代码库)。这就是程序大小发挥作用的地方。虽然它可能不像 Halstead、McCabe 或 Nejmeh 的算法那么科学,但结合其他测量方法,程序大小可以帮助我们找到应用程序中可能的痛点。如果我们正在寻找一种务实、省力的方法来量化代码的复杂性,那么基于大小的指标就是最佳选择。
Unfortunately, control flow graph metrics can be difficult (and sometimes expensive) to calculate, particularly for very large codebases (which are precisely the ones we’re looking to improve). This is where program size comes into play. Although it may not be quite as scientific as Halstead, McCabe, or Nejmeh’s algorithms, combined with other measurements, program size can help us locate likely pain points in our application. If we’re looking for a pragmatic, low-effort approach to quantifying the complexity of our code, then size-based metrics are the way to go.
在测量代码长度时,我们有几个可用的选项。大多数开发人员选择仅测量逻辑代码行,完全忽略空行和注释。与控制流指标一样,我们可以在多种分辨率下收集此信息。我发现以下几个数据点是非常有用的参考点:
When measuring code length, we have a few options available to us. Most developers choose to measure only logical lines of code, omitting empty lines and comments entirely. As with our control flow metrics, we can collect this information at a number of resolutions. I’ve found the following few data points to be quite helpful reference points:
每个代码库都有这样一种文件,如果你从头开始滚动,似乎可能无法到达末尾。测量这些文件的代码行数可能会准确地反映出当开发人员在编辑器中打开它们时了解其内容和职责所需的心理负担。
Every codebase has the kind of files that look as if you might not reach the end if you started scrolling from the beginning. Measuring the number of lines of code for these would likely accurately capture the psychological overhead required to understand their contents and responsibilities when a developer pops them open in their editor.
对于每个无限文件,都有一个无限函数。(通常情况下,无限函数存在于无限文件中。)测量应用程序中函数或方法的长度可以成为一种估算其各自复杂性的有效方法。
For every endless file, there’s an endless function. (More often than not, the endless functions are found in the endless files.) Measuring the length of functions or methods within your application can be a helpful way of approximating their individual complexities.
根据应用程序的组织方式,您可能需要跟踪每个逻辑单元的平均函数或方法长度。在面向对象的代码库中,您可能希望跟踪类或包中每个方法的平均长度。在命令式代码库中,您可以测量文件或较大模块中每个函数的平均长度。无论组织单元有多大,了解其中包含的较小逻辑组件的平均长度都可以让您了解整个单元的相对复杂程度。
Depending on how your application is organized, you may want to keep track of the average function or method length per logical unit. In object-oriented codebases, you likely want to keep track of the average length of each method within a class or package. In an imperative codebase, you might measure the average length of each function within a file or larger module. Whatever the greater organizational unit, knowing the average length of the smaller logical components contained within it can give you an indication of the relative complexity of that unit as a whole.
LOC 可能会有很大差异,具体取决于程序语言或编程风格,但如果我们在比较同类事物,就不应该太担心。在大规模重构时,我们通常关注的是改进单个大型代码库中的代码。根据我的经验,绝大多数使用这些代码库的开发人员都已投入精力建立样式指南,定义一套最佳实践,并经常使用自动格式化程序来执行这些规则。不同团队和组件之间不可避免地会存在一些差异,但从广义上讲,整个应用程序往往看起来足够相似,因此来自代码库不同部分的两组 LOC 指标应该仍然可以比较。
LOC might vary wildly, depending on the language of a program or programming style, but if we’re comparing apples to apples, we shouldn’t be too concerned. When refactoring at scale, we’re generally concerned with improving code within a single, large codebase. In my experience, the vast majority of developers working with these codebases have invested in establishing style guides, defining a set of best practices, and often enforcing these rules with autoformatters. Some variation is inevitable across teams and components, but broadly speaking, the application as a whole tends to look similar enough that two sets of LOC metrics from distinct sections of the codebase should still be comparable.
在开发新功能时,我们可以采用几种测试理念。我们可以选择测试驱动开发 (TDD) 方法,先编写一套完整的测试,然后迭代实现,直到测试通过;我们可以先编写解决方案,然后再进行相应的测试;或者我们可以决定在两者之间交替进行,逐步构建实现,每次迭代时暂停编写少量测试。无论我们采用哪种方法,期望的结果都是一样的:一项新功能,完全由一组高质量的测试支持。
When we’re developing new features, there are a few testing philosophies we can adopt. We can opt for a test-driven development (TDD) approach, writing a thorough suite of tests first and then iterating on an implementation until the tests pass; we can write our solution first, followed by the corresponding tests; or we can decide to alternate between the two, incrementally building an implementation, pausing to write a handful of tests with each iteration. Whatever our approach, the desired outcome is the same: a new feature, fully backed by a quality set of tests.
重构则是另一回事。当我们努力改进现有实现时,无论我们努力的程度如何,我们都希望确保我们正确地保留了其行为。我们可以放心地断言,通过依赖原始实现的测试套件,我们的新解决方案将继续与旧解决方案完全相同。因为我们依赖测试覆盖率来警告我们潜在的回归,所以我们在开始重构工作之前需要验证两件事:首先,确认原始实现具有测试覆盖率;其次,确定该测试覆盖率是否足够。
Refactoring is a different beast. When we’re working to improve an existing implementation, whatever the extent of our endeavor, we want to be sure that we’re correctly retaining its behavior. We can safely assert that our new solution continues to work identically to the old by relying on the original implementation’s test suite. Because we are relying on the test coverage to warn us about potential regressions, we need to verify two things before beginning our refactoring effort: first, confirm that the original implementation has test coverage and, second, determine whether that test coverage is adequate.
假设我们要重构示例 3-2primeFactors中的函数。在考虑进行任何更改之前,我们需要测量它是否有测试覆盖率,如果有,则测试覆盖率是否足够。验证实现是否具有测试覆盖率很容易。我们可以打开相应的测试文件并查看它包含的内容。对于我们的示例,我们只找到一个测试,如示例 3-6所示。
Say we want to refactor our primeFactors function in Example 3-2. Before we consider making any changes, we need to measure whether it has test coverage and, if it does, whether that test coverage is sufficient. Verifying that the implementation has test coverage is easy. We can just pop open the corresponding test file and take a peek at what it contains. For our example, we find just one test, shown in Example 3-6.
primeFactorsdescribe('base cases',()=>{test('0',()=>{expect(primeFactors(0)).toStrictEqual([]);});});
describe('base cases',()=>{test('0',()=>{expect(primeFactors(0)).toStrictEqual([]);});});
然而,确定测试覆盖率是否足够是一项比较棘手的任务。我们可以通过两种方式进行评估:定量和定性。定量上,我们可以计算一个百分比,表示测试套件运行时执行的代码比例。我们可以收集简单单元测试测试的功能代码行数和执行路径数的指标,分别得出 40% 和 35.71%。示例 3-7显示了使用 Jest 单元测试框架生成的测试输出。
Determining whether that test coverage is adequate, however, is a trickier task. We can evaluate it in two ways: quantitatively and qualitatively. Quantitatively, we can calculate a percentage representing the proportion of code that is executed when the test suite is run against it. We can collect metrics for both the number of functional lines of code and the number of execution paths tested by our simple unit test, yielding 40 percent and 35.71 percent, respectively. Example 3-7 shows the test output generated with the Jest unit testing framework.
primeFactors给定单个测试用例,Jest 测试覆盖率输出---|---------|----------|---------|- --------|-------------------- 文件 | % 语句 | % 分支 | % 函数 | % 行 | 未覆盖的行号 ---|---------|----------|---------|- --------|-------------------- 所有文件 | 35.71 | 0 | 50 | 40 | primeFactors.js | 35.71 | 0 | 50 | 40 | 3-6,11-13 ---|---------|----------|---------|- --------|-------------------- 测试套件:1 个通过,共 1 个 测试:1 项通过,共 1 项
-----------------|---------|----------|---------|---------|------------------- File | % Stmts | % Branch | % Funcs | % Lines | Uncovered Line #s -----------------|---------|----------|---------|---------|------------------- All files | 35.71 | 0 | 50 | 40 | primeFactors.js | 35.71 | 0 | 50 | 40 | 3-6,11-13 -----------------|---------|----------|---------|---------|------------------- Test Suites: 1 passed, 1 total Tests: 1 passed, 1 total
现在,我们必须决定这是否是足够的测试覆盖率。这两个指标都不能让我对经过充分测试的测试结果充满信心primeFactors;毕竟,这表明我们当前的套件没有使用超过四分之三的功能。测试覆盖率主要在两个方面有用:
Now, we have to decide whether this is adequate test coverage. Neither metric fills me with great confidence that primeFactors is particularly well-tested; after all, this indicates that over three-fourths of the function is not being exercised by our current suite. Test coverage is primarily useful in two ways:
帮助我们识别程序中未经测试的路径
Helping us identify untested paths in our program
作为我们是否进行了足够测试的大致衡量标准
Serving as a ballpark measure of whether we have tested enough
如果您正在寻找测试遗留软件的策略,我建议您阅读Michael Feathers 撰写的《有效使用遗留代码》。他讨论了许多选项,介绍如何通过利用代码中的接缝来追溯引入单元测试,这些接缝是战略位置,您可以在不修改代码本身的情况下更改程序的行为。
If you are looking for strategies for testing legacy software, I recommend picking up a copy of Working Effectively with Legacy Code by Michael Feathers. He discusses a bevy of options for how to introduce unit tests retroactively by capitalizing on seams in the code, strategic places where you can change the behavior of your program without modifying the code itself.
为了提高示例的测试覆盖率,我们可以再添加一个测试用例,如示例 3-8所示。如果我们重新计算覆盖率(参见示例 3-9),我们会发现只需增加一个测试用例,就可以实现近乎完美的覆盖率。这是否意味着我们的测试覆盖率足够了?从数量上看,它可能看起来足够了;但从质量上看,可能不够。回顾我们的实现
primeFactors,我们很容易就能发现一些缺失的测试用例,例如提供一个负数或数字2。
To improve the test coverage for our example, we can add one more test case, as shown in Example 3-8. If we recalculate our coverage (see Example 3-9), we notice that with just one additional test case, we can achieve near-perfect coverage. Does this mean that our test coverage is adequate? Quantitatively it might appear to be sufficient; qualitatively it might not be. Peeking back at our implementation for
primeFactors, we can easily identify a few missing test cases, such as providing a negative number, or the number 2.
primeFactorsdescribe('base cases',()=>{test('0',()=>{expect(primeFactors(0)).toStrictEqual([]);});});describe('small non-prime numbers',()=>{test('20',()=>{expect(primeFactors(0)).toStrictEqual([2,5]);});});
describe('base cases',()=>{test('0',()=>{expect(primeFactors(0)).toStrictEqual([]);});});describe('small non-prime numbers',()=>{test('20',()=>{expect(primeFactors(0)).toStrictEqual([2,5]);});});
primeFactors给定两个测试用例,Jest 测试覆盖率输出---|---------|----------|---------|- --------|-------------------- 文件 | % 语句 | % 分支 | % 函数 | % 行 | 未覆盖的行号 ---|---------|----------|---------|- --------|-------------------- 所有文件 | 100 | 83.33 | 100 | 100 | primeFactors.js | 100 | 83.33 | 100 | 100 | 12 ---|---------|----------|---------|- --------|-------------------- 测试套件:1 个通过,共 1 个 测试:通过 2 项,共计 2 项
-----------------|---------|----------|---------|---------|------------------- File | % Stmts | % Branch | % Funcs | % Lines | Uncovered Line #s -----------------|---------|----------|---------|---------|------------------- All files | 100 | 83.33 | 100 | 100 | primeFactors.js | 100 | 83.33 | 100 | 100 | 12 -----------------|---------|----------|---------|---------|------------------- Test Suites: 1 passed, 1 total Tests: 2 passed, 2 total
根据我的经验,精心编写的代码通常具有 80% 到 90% 的测试覆盖率。这表明大多数代码都经过了测试。但请注意,仅凭测试覆盖率并不能表明某项测试的完善程度。编写低质量的单元测试以达到完美或近乎完美的测试覆盖率很容易。如果管理层鼓励高测试覆盖率,您通常会发现,很大一部分单元测试几乎不会努力断言相应代码的重要行为。
In my experience, thoughtfully written code generally has between 80 and 90 percent test coverage. This shows that the majority of the code is tested. Be forewarned, however, that test coverage alone is not an indication of how well-tested something is. It’s easy to write low-quality unit tests to reach perfect or near-perfect test coverage. If high test coverage is incentivized by management, you will typically find that a significant portion of your unit tests make little effort to assert the corresponding code’s important behavior.
从定性角度来看,确定测试覆盖率是否足够并不是那么简单。关于这一点已经有很多深思熟虑的文章,其中大部分超出了本书的范围,但从高层次来看,我认为如果满足以下几点,就可以达到合适的测试质量:
From a qualitative standpoint, determining whether test coverage is sufficient is not so simple. There is a great deal of thoughtful writing about this already, most of which goes beyond the scope of this book, but at a high level, I think suitable test quality has been attained if the following points hold true:
The tests are reliable. From one run to the next, they consistently produce passing results when run against unchanged code and catch bugs during development.
The tests are resilient. They are not so tightly coupled to implementation that they stifle change.
A range of test types exercise the code. Having unit, integration, and end-to-end tests can help us assert that our code is functioning as intended with different levels of fidelity.
如果我们断言测试覆盖率和测试质量足够高,那么我们就应该有信心继续进行重构工作。如果测试覆盖率或质量不足,我们需要花费必要的时间预先编写更多、更好的测试。测量我们打算重构的每个代码段的测试数量和质量是帮助我们确定在开始重构之前需要投入多少额外工作的重要步骤。
If we have asserted that the test coverage and test quality is substantial enough, then we should be confident in moving forward with our refactoring effort. If tests are lacking either in coverage or quality, we need to spend the requisite time writing more, and better, tests up front. Measuring the test quantity and quality of each of the sections of code we intend to refactor is an important step in helping us determine how much additional work we need to commit to before we begin refactoring.
在开始重构之前,我们应该盘点一下现有的相关文档。阅读文档可以帮助我们获得有关代码的有价值的额外背景信息。虽然文档不是衡量我们起始状态的良好数字指标来源,但它是我们用来描述我们寻求改进的当前问题的重要证据来源。我们将讨论在尝试理解和量化我们预期大规模重构工作的起点时应该关注的两种文档形式。这些是正式和非正式的文档形式。
Before we start refactoring something, we should take stock of any existing documentation about it. Reading through the documentation may help us gain valuable, additional context on the code. While documentation is not a great source of numerical metrics we can use to measure our starting state, it is a critical source of evidence we can use to describe the current problems we seek to improve. We’ll discuss two forms of documentation we should be concerned about when trying to understand and quantify our starting point in anticipation of a large refactoring effort. These are formal and informal forms of documentation.
正式文档是您最有可能想到的所有内容。它不必遵循任何官方的行业级标准(例如统一建模语言 [UML])。相反,正式文档的正式之处在于它是专门编写的(并且在许多情况下是积极维护的),以便向读者介绍您的系统。技术规格、架构图、风格指南、入门材料和事后总结都是正式文档的一些示例。
Formal documentation is everything you most likely think of as documentation. This doesn’t have to follow any official, industry-level standard (like Unified Modeling Language [UML]). Rather, what makes it formal is that it was deliberately authored (and, in many cases, is actively maintained) to inform the reader about your system. Technical specs, architecture diagrams, style guides, onboarding materials, and postmortems are a few examples of formal documentation.
我们可以使用技术规范之类的东西作为证据,证明我们的重构是必要或有用的,方法是引用设计决策、假设或其他考虑或拒绝的设计。例如,假设您正在处理应用程序的一个子部分,该子部分负责处理产品中所有与用户相关的操作。当前的实现要求开发人员编写新功能来记住并枚举当用户修改其个人资料时需要触发并传播到同级子系统的每种事件。如果您的团队有为每项功能编写技术设计规范的历史,您可以找到事件传播的原始规范文档。本文档描述了当前的实现、其局限性以及任何替代方法。
We can use things like technical specs as evidence that our refactor is necessary or useful by referencing design decisions, assumptions, or other designs considered or rejected. Say, for instance, you work on a subsection of your application responsible for processing all user-related actions within your product. The current implementation requires developers writing new features to remember and enumerate every kind of event that needs to be fired and propagated to sibling subsystems when a user modifies their profile. If your team has a history of writing technical design specs for each of their features, you can locate the original specification document for event propagation. This document describes the current implementation, its limitations, and any alternative approaches.
限制部分指出,虽然在每个位置单独触发每个所需事件可能很方便,但如果团队引入大量新事件,这可能会变得笨拙和繁重。今天,您的系统正遇到同样的问题。它处理十几种事件类型,您的团队正在努力跟踪这种蔓延。对于每个新功能,您的团队都担心忘记触发关键事件类型并可能引入令人讨厌的错误。您已尽最大努力通过测试断言所需的行为,但决定重构这些事件的处理方式是驯服重复逻辑混乱的最佳解决方案。
The limitations section states that while it might be convenient to trigger each required event individually at every location, if the team introduces a substantial number of new events, it might become clumsy and burdensome. Today, your system is experiencing that exact problem. It handles more than a dozen event types and your team is struggling to keep track of the sprawl. With every new feature, your team fears forgetting to trigger a critical event type and potentially introducing a pesky bug. You’ve done your best to assert the desired behavior with tests but decide that refactoring how these events are handled is the best solution to taming the chaos of repetitive logic.
技术规格对于支持您的假设非常有用,可以准确地说明需要改进的内容和改进方法。有时,这些文档会概述考虑过但最终未采用的替代方法。您可以在重构工作中探索其中一种建议。
Technical specs can be very helpful in supporting your hypothesis of exactly what needs to be improved and how. Occasionally, these documents outline alternative approaches considered but not ultimately chosen. You may be able to explore one of these suggestions with your refactoring effort.
风格指南和入门材料的维护者有时会在他们编写的文档中留下他们经验的痕迹。如果他们最近意外发现了某些东西的工作原理,并试图根据这种经验改进文档,那么您可能会在他们的文章中看到一些蛛丝马迹。您可能会发现大号粗体文字警告您不要做什么。在这类文档中,经常会看到大量内容专门用于代码库中特别复杂的部分;公司中会有更多人投入更多时间试图引导读者走上正确的道路,远离他们自己陷入的陷阱。如果您要重构的代码记录在这些来源中并遵循这些模式,那么这可能是可以显著改进的良好证据。考虑一下目标代码文档的理想语气和内容,并以此为灵感。
Maintainers of style guides and onboarding materials can sometimes leave traces of their experiences in the documentation they produce. If they’ve recently made an unexpected discovery about how something works and sought to improve the documentation as a result of that experience, you might be able to catch a glimpse of that in their writing. You might find warnings in large, bolded text of exactly what not to do. It’s also not uncommon to see a disproportionate amount of content devoted to particularly complex pieces of the codebase in these kinds of documents; more people across the company will have devoted more time to trying to steer readers in the right direction, away from the pitfalls they themselves fell into. If the code you want to refactor is documented in these sources and follows these patterns, it might be good evidence that it can be measurably improved. Think about the ideal tone and content of the documentation for your target code and use that as inspiration.
事后分析可以作为很好的支持证据。如果您的团队遵循PagerDuty 事件响应流程并且已经这样做了一段时间,那么您可能会获得数十份事后分析文档,其中详细说明了您的应用程序未按预期运行的每个实例的事件、地点、时间、原因和方式。
Postmortems can serve as great supporting evidence. If your team follows the PagerDuty incident response process and has been doing so for some time, then you likely have access to dozens of postmortem documents detailing the what, where, when, why, and how of every instance where your application wasn’t behaving as expected.
在为值得重构的代码构建案例时,我会搜索事后分析,总结我认为直接涉及该代码的事件。然后我阅读了两个部分:“促成因素”和“什么地方做得不好?”当我怀疑代码的复杂性直接影响了解决问题的时间,甚至可能首先导致了事件发生时,这两个部分可能会证实这一点。列出要重构的区域作为问题的事件数量是一个有价值的指标。
When building a case for code that is worth refactoring, I search for postmortems summarizing incidents I believe directly involved that code. Then I read through two sections: “Contributing Factors” and “What Didn’t Go So Well?” When I suspect that the complexity of the code had a direct impact on the time to resolution or perhaps even caused the incident in the first place, these two sections will likely confirm it. A count of the number of incidents that list the area you want to refactor as a problem makes a valuable metric.
注意第三方或面向公众的文档也很重要。虽然重构并不意味着要修改应用程序使用者的行为,但这些文档对于增强您对打算重写的代码的理解特别有用。
It’s also important to take note of third-party or publicly facing documentation. While refactoring is not meant to modify the behavior for consumers of your application, this documentation can be particularly useful for bolstering your understanding of the code you’re intending to rewrite.
除了正式文档外,我们还制作了各种非正式文档。这些是我们认为不是正式文档的书面材料,因为它们通常不以文档形式出现。根据我的经验,我发现非正式来源中的缺陷比任何正式文档都多。
Alongside our formal documentation, we produce a wide range of informal documentation. These are the kinds of written artifacts that we don’t consider to be proper documentation simply because they don’t typically occur in document form. In my experience, I’ve found more speckled throughout informal sources than in any formal documentation.
找到这些资源的关键在于打破思维局限。我在这里列举几个,但请留意您周围的其他资源。您可能会大吃一惊!
Finding these sources is all about thinking outside the box. I’ll enumerate a few here but keep your eyes peeled for other sources around you. You just might surprise yourself!
聊天和电子邮件记录可以提供有关您要重构的代码的深刻信息。最重要的是,这些信息通常可以提供大量背景信息,包括历史和组织信息。例如,假设您想要重构应用程序中异步作业的结构。作业队列系统目前接受一组任意大小的动态参数,以最大限度地提高其使用者的灵活性。不幸的是,这导致了对其实际限制的相当多的困惑,使系统在处理具有极大参数负载的作业时面临内存耗尽的风险,或者在无法解析格式错误的输入时突然崩溃。
Chat and email transcripts can provide insightful information about the code you’re seeking to refactor. Best of all, these often grant a good deal of context, both historical and organizational pieces of information. Say, for instance, you want to refactor how asynchronous jobs are structured in your application. The job queue system currently accepts a dynamic set of arguments of arbitrary size to maximize flexibility for its consumers. Unfortunately, this has led to quite a bit of confusion around its actual limitations, putting the system at risk of running out of memory when processing jobs with extremely large argument payloads, or crashing abruptly when it is unable to parse malformed inputs.
您想确保自己对系统模糊性的体验对您和您的团队来说不是轶事。为了衡量编写新作业的麻烦程度,您在公司的 Slack(或其他消息传递解决方案)中搜索一组与作业队列参数相关的关键字。不出所料,您遇到了许多消息,其中有人对他们的工作没有按预期工作感到惊讶或担心。整个公司的开发人员都在询问他们是否应该提供原始或不透明的 ID。为什么是其中一种而不是另一种?我们要记录这些作业参数吗?如果是,我们是否需要小心包含个人身份信息?我们可以通过这些参数发送多少数据?我们是否能够序列化整个对象并将其提供给作业队列系统?
You want to be certain that your experience with the system’s ambiguity is not anecdotal to you and your team. To measure how troublesome writing new jobs is, you search your company’s Slack (or other messaging solution) for a set of keywords that relate to job queue arguments. Unsurprisingly, you come across a number of messages where someone was surprised or concerned that their job didn’t work as intended. Developers across the company are asking whether they should provide raw or opaque IDs. Why one over the other? Do we log these job arguments? If so, do we need to be careful about including personally identifiable information? How much data can we send via these arguments? Are we able to serialize entire objects and supply these to the job queue system?
您创建一个指向每条消息的文档,并简短描述每条消息的背景。(只需在对话中短暂回滚即可轻松完成此操作。)现在,您可以参考这些实例来说明开发人员当前遇到的困难。
You create a document that points to each of these messages, with a short description of the context around each. (This should be easy to do with a short backscroll through the conversation.) Now you can reference these instances to demonstrate the difficulty that developers are currently running into.
聊天记录让您能够查看您入职前很久的对话。您可能会惊讶地发现,各个工程团队的人都在谈论您入职前几个月或几年急于解决的问题。您可能会遇到其他人定期提出同样的问题。当这种情况发生时,这不仅对您的努力非常有价值,而且您可以通过联系这些团队的人并询问他们对您想要改进的代码的经验来获得一些有价值的盟友。从数量上讲,您可以使用这些对话来估算由于对您想要改进的代码的困惑而损失了多少工程时间并回答有关问题。
Chat history gives you the unique ability to peek into conversations that occurred long before your arrival. You might be surprised to see people spread across a variety of engineering teams talking about the problems you’re eager to fix months or years before your first day on the job. You might encounter others asking the same question at a regular cadence. When this happens, not only is it extremely validating to your endeavor, but you may get some valuable allies by reaching out to the folks on those teams and asking them about their experience with the code you want to improve. Quantitatively, you can use these conversations to approximate how many engineering hours are lost due to confusion about the code you want to improve and answer questions about it.
根据工程团队选择的项目管理工具,您可以通过在错误跟踪系统中搜索相关错误来收集与要重构的代码相关的一些重要指标。您还可以估算其他团队或个人开发人员调查和修复错误或实施与目标代码相关的更改所花费的时间。
Depending on your engineering team’s project management tools of choice, you may be able to gather some important metrics related to the code you want to refactor by searching for related bugs in your bug tracking system. You might also be able to estimate the amount of time other teams or individual developers have spent investigating and fixing bugs or implementing changes related to your target code.
假设某个特定功能或功能集周围的代码随着时间的推移变得越来越复杂。您想投入精力来整理它,以便您的团队能够以更快的速度开发。如果您怀疑团队的速度已经降低,您可以使用项目管理软件来确认这一点。请注意,这是一个非常粗略的指标(与我们所有其他指标一样,仅量化整体问题的一个方面)。您可能需要深入了解您的团队如何组织其开发周期,并自信地删除数据中的异常值,以便能够在此处得出一个令人信服的指标,但对于某些团队来说,这可能是一个无可争议的指标!
Say the code around a particular feature or feature set has been gaining complexity over time. You want to invest effort in tidying it up so that your team can develop at a quicker pace. If you suspect that your team’s velocity has decreased, you can use your project management software to confirm it. Note that this is a very coarse metric (and as with all of our other metrics, only quantifies a single aspect of the overall problem). You will probably need intimate knowledge of how your team organizes its development cycles and confidently remove outliers in your data to be able to tease out a compelling metric here, but for some teams, it can be an indisputable one!
一些公司的技术项目经理可以成为帮助您收集、筛选和传播此类指标的宝贵资源。他们通常擅长使用项目管理工具和查找难以找到的文档。谁知道呢,您甚至可能会结识一位新朋友!
Technical program managers at some companies can be a great resource for helping you collect, filter, and disseminate these kinds of metrics. They are often whizzes at navigating project management tools and locating hard-to-find documents. Who knows, you might even make a new friend!
此时,这一切听起来像是量化给定问题的过多调查工作。没关系!由您决定哪些指标对传达问题的严重性和修复问题的潜在好处最有影响。您可能不想或不需要花时间挖掘数百个任务或事后分析,但如果这些信息易于理解和搜索,那么它可能值得。这些指标在试图说服与代码完全无关的管理和领导团队重构是值得的时尤其有用。
At this point, this may all sound like an excessive amount of investigatory work to quantify a given problem. That’s okay! It’s up to you to decide which metrics will have the most impact in communicating the severity of the problem and the potential benefit of fixing it. You may not want or need to spend the time digging through hundreds of tasks or postmortems, but if this information is easy to consume and search, it might be worthwhile. These metrics can especially come in handy when trying to convince management and leadership teams that are highly removed from the code that refactoring is worthwhile.
我们主要将版本控制视为管理应用程序更改的工具。我们使用它来逐步推进,允许同时开发多个功能,并逐步交付这些功能。有时,我们用它来参考代码的先前版本以追踪错误或找到可能了解我们正在阅读的代码部分的人。我们很少将版本控制视为汇总分析时有关我们团队开发模式的信息来源。事实证明,当我们从不同的角度看待我们的提交时,我们可以了解到工程团队面临的很多问题。
We primarily think of version control as a tool to manage changes to our applications. We use it to move forward incrementally, allowing for the development of multiple features at once, and progressive shipment of those features. Sometimes, we use it to refer to previous versions of our code to track down a bug or locate someone who might know about the section of code we’re reading. We rarely think of version control as a source of information about our team’s development patterns when analyzed in aggregate. Turns out, we can glean quite a bit about the problems our engineering team is facing when we take a look at our commits from a different perspective.
虽然并非每个人都将编写描述性提交消息作为其开发方法的一部分,但如果您所在的团队中大多数开发人员都这样做,这些简短的描述可以让他们了解可能遇到的问题。我们可以通过搜索一组关键字或隔离与我们感兴趣的一组文件的更改相关的提交消息来识别模式。
Although not everyone makes writing descriptive commit messages part of their development method, if you work on a team where a majority of developers do, these short descriptions can provide a glimpse into the issues that they might be running into. We can identify patterns either by searching for a set of keywords or by isolating commit messages associated with changes to a set of files we’re interested in.
假设我们正在研究之前的作业队列系统问题。我们知道工程师经常在将作业放入队列之前忘记清理他们的参数,从而导致记录个人身份信息 (PII)。我们可以搜索提交消息并识别相应消息中包含“作业”、“作业处理程序”或“PII”等字词的提交。从这个结果集中,我们可能会发现大量提交,这些提交要么引入了导致 PII 泄漏的新作业,要么修复了已经泄漏 PII 的作业。或者,如果我们的作业处理程序方便地组织到不同的文件中,我们可以缩小搜索范围,仅包含对这些文件进行修改的提交,并梳理派生集以查找类似 模式。
Let’s say we’re looking at our job queue system problem from earlier. We know that engineers regularly forget to sanitize their arguments before enqueueing jobs, resulting in logging personally identifiable information (PII). We can search through our commit messages and identify commits where the corresponding messages include words like “job,” “job handler,” or “PII.” From this result set, we might find a substantial set of commits that either introduced a new job responsible for leaking PII or fixed a job already leaking it. Alternatively, if our job handlers are conveniently organized into distinct files, we could narrow our search to include only commits with modifications to these files and comb through the derived set for similar patterns.
一些开发团队通过在提交消息或分支名称中突出显示错误或票号,将他们的提交或变更集与他们的项目管理工具联系起来。如果我们可以获得这些信息,我们可以将变更集链接到我们之前收集的开发速度和错误计数指标。一切都圆满结束了!
Some development teams relate their commits or changesets to their project management tools by highlighting bug or ticket numbers in the commit message or branch name. If this information is available to us, we can link the changeset back to our previous collection of metrics on development velocity and bug count. It all comes full circle!
在他的书《软件设计 X-Rays》中,Adam Tornhill 提出了一套从版本历史中梳理出重要开发模式的技术。他假设这些开发行为可以帮助您确定重构时应优先考虑应用程序的哪些部分,说明某些功能的复杂性如何随时间变化,并突出显示任何紧密耦合的文件或模块。我强烈建议您阅读他的研究,以充分理解这些测量如此具有启发性的心理学原理,但我将在这里总结一些基本技术,以便您在下一次大规模重构之前考虑它们。
In his book, Software Design X-Rays, Adam Tornhill proposes a set of techniques for teasing out important development patterns from version history. He hypothesizes that these development behaviors can help you identify which sections of your application you should prioritize when refactoring, illustrate how the complexity of certain functions have changed over time, and highlight any tightly coupled files or modules. I highly recommend reading his research to comprehend fully the psychology behind why these measurements are so enlightening, but I’ll summarize the basic techniques here so that you might consider them ahead of your next big refactor.
更改频率是指在应用程序的完整版本历史中对每个文件进行的提交次数。您可以通过从提交历史中提取文件名、汇总它们并按频率从高到低的顺序排列它们来轻松生成这些数据点。在实践中,Tornhill 注意到这些频率往往遵循幂分布,其中不成比例的更改发生在一小部分核心文件中。了解最常提交的文件可以让我们准确地知道哪些文件需要最容易让开发人员理解和浏览,因此,从开发人员生产力的角度来看,我们应该花费最多的精力来维护哪些文件。
Change frequencies are the number of commits made to each file over the complete version history of your application. You can easily generate these data points by extracting file names from your commit history, aggregating them, and ordering them from most to least frequent. In practice, Tornhill noticed that these frequencies tended to follow a power distribution, where a disproportionate number of changes occur in a small subset of core files. Knowing the files that are committed to most often tells us exactly which files need to be the easiest to understand and navigate for developers and, therefore, which files we should spend the most effort maintaining, from a developer productivity perspective.
我们可以将相同的更改频率概念应用于文件。通过查看单个提交,我们可以仔细地将更改归因于单个文件中的各个函数,从而为每个函数生成总频率数字。通过将这些数据与我们之前的复杂性指标之一(代码行)相结合,我们可以绘制整个代码库的复杂性随时间的变化。这些信息向我们展示了有待改进的潜在热点。我们可以在完成重构后重新生成这些指标,以确认这些热点的复杂性不仅降低了,而且希望它们的更改频率也降低了。
We can apply the same concept of change frequencies to files as well. By looking at individual commits, we can carefully attribute changes to respective functions within individual files, producing total frequency numbers for each of them. By combining this data with one of our earlier complexity metrics, lines of code, we can map complexity changes over time across the entire codebase. This information shows us potential hotspots ripe for improvement. We can later regenerate these metrics once we’ve completed our refactor to confirm that not only the complexity of these hotspots decreased, but hopefully their change frequency had as well.
Tornhill 还描述了一种通过查看在同一提交中修改的文件集来精确定位程序中紧密耦合的模块的方法。为了描述这个想法,假设我们有三个文件,superheroes.js、supervillains.js和sidekicks.js。在我们的提交的一个子集中,我们有以下更改:提交一修改了superheroes.js和sidekicks.js;提交二修改了所有三个文件;提交三再次修改了superheros.js和sidekicks.js;提交四只涉及superheroes.js 。从表 3-3所示的这个版本历史子集中,我们注意到在四个提交中,其中三个修改了superheroes.js和sidekicks.js。这暗示着这两个文件之间存在某种耦合。当然,并非所有耦合都是坏事(源代码和相应的单元测试文件的更改就是这种情况),但在某些情况下,这些模式可能表明存在错误的抽象、复制粘贴的代码,或者有时两者兼而有之。一旦我们确定了这些问题,我们就可以努力修复它们,然后在稍后重新运行分析以确认它们不再存在。
Tornhill also describes a method for pinpointing tightly coupled modules in your program by looking at sets of files modified within the same commit. To depict this idea, let’s say we have three files, superheroes.js, supervillains.js, and sidekicks.js. In a subset of our commits, we have the following changes: commit one modifies both superheroes.js and sidekicks.js; commit two modifies all three files; commit three again modifies superheros.js and sidekicks.js; and commit four only touches superheroes.js. From this subset of our version history, depicted in Table 3-3, we notice that of four commits, three of them modified both superheroes.js and sidekicks.js. This insinuates that some kind of coupling between these two files exists. Certainly not all coupling is bad (as is the case for changes in source code and the corresponding unit test files), but in some cases these patterns can indicate an erroneous abstraction, copy-pasted code, or sometimes both. Once we’ve pinpointed these problems, we can work to fix them and then rerun the analysis sometime later to confirm that they no longer exist.
| 犯罪 # | 超级英雄.js | 超级恶棍.js | sidekicks.js |
|---|---|---|---|
1 1 |
十 x |
十 x |
|
2 2 |
十 x |
十 x |
十 x |
3 3 |
十 x |
十 x |
|
4 4 |
十 x |
与本章中的每个定量指标一样,这种测量也有一些注意事项。不同的开发人员在提交更改方面有不同的做法。一些程序员会进行大量的小提交;其他程序员会进行大量的提交,包括对多个文件的数十次更改,提交到一个变更集中。此外,这种分析很可能会发现一些异常值(配置文件经常更改或自动生成代码中的热点)。我们在仔细研究数据时必须对这些异常保持警惕,以降低在可能不存在的地方发现问题的风险。
As with each of our quantitative metrics in this chapter, there are some caveats to this kind of measurement. Different developers have different practices around committing changes. Some programmers will make a large quantity of tiny commits; others will make large commits, including dozens of changes across multiple files, into a single changeset. Moreover, it’s entirely likely this analysis will reveal some outliers (configuration files frequently changed or hotspots in autogenerated code). We have to be vigilant about these anomalies when poring over the data to mitigate the risk of finding problems where there might not be any.
无论我们是否意识到,软件系统的各个部分都有不同的声誉。有些声誉比其他的更强烈;有些是正面的,有些则是负面的。然而,无论声誉如何,它都会随着时间的推移慢慢建立起来,随着越来越多的工程师与代码交互,在整个工程组织中传播开来。最糟糕的代码库的消息有时甚至会传到公司外面,传到更广泛的行业,在朋友晚餐和互联网论坛上讨论。无论这些声誉是否继续成立,它们都可以告诉我们很多关于我们应用程序中一些最麻烦的部分的信息,以及它们是多么迫切地需要我们的关注。
Whether we’re aware of it or not, each of the many sections of our software systems have distinct reputations. Some reputations are stronger than others; some are positive, some are deeply negative. Whatever the reputation, however, it is slowly built up over time, spreading across the engineering organization as more and more engineers interact with the code. Word of the most disastrous codebases sometimes even travels outside of your company and into the wider industry, discussed over dinner among friends and on internet forums. Whether these reputations continue to hold true or not, they can tell us plenty about some of the most troublesome pieces of our applications and just how desperately they need our attention.
收集声誉数据的一种简单且省力的方法是采访其他开发人员。假设您正在开发一款按月收费的应用程序,并且想要改进应用程序的计费代码。您安排了一些开发人员的采访,这些开发人员分为几类:经常直接使用计费代码的开发人员和偶尔使用计费代码的开发人员。对于这两类开发人员,您需要采访在当前团队和公司内部任职时间不一的开发人员;多年来一直与计费代码密切合作的开发人员的经历可能与六个月前刚入职的工程师的经历大不相同。
A simple, low-effort means of collecting reputation data is to interview fellow developers. Let’s assume you work on an application that charges customers for a monthly service and you want to improve your application’s billing code. You set up some interviews with developers that fall into a few categories: those who work directly with the billing code on a regular basis, and those who have worked with it on occasion. For each of these two sets, you’ll want to speak to developers who have a range of tenures on their current team and within the company; the experiences of those who have worked integrally with the billing code for years are probably pretty different from those of an engineer who was hired six months ago.
然后,我们得出一组问题,这些问题将帮助我们描述他们的体验。我们首先提出几个问题来了解他们的背景,然后深入了解他们对代码的想法。表 3-4中列出了一些建议,供您参考。
We then derive a set of questions that will help us characterize their experience. We begin with a few questions to frame their background and then delve into their thoughts about the code. A few are suggested in Table 3-4 to get you started.
鉴于您对计费代码的经验,当您评估哪些文件可以从彻底重构中受益最多时,您立即想到了chargeCustomerCard.js+。您决定向受访者询问该文件,看看它会引起什么样的反应。如果您一提到chargeCustomerCard.js,您的受访者就愁眉苦脸,无论他们是否对该文件的内部工作原理了如指掌,这都强烈表明该文件可能需要一点关爱。
Given your experience with the billing code, when you were evaluating which files could benefit the most from a thorough refactor, you immediately thought of chargeCustomerCard.js+. You decide to ask your interviewees about the file to see what sort of reaction it elicits. If the second you mention chargeCustomerCard.js, your interviewee grimaces, whether they have intimate knowledge of the inner workings of that file or not, that’s a strong indication that the file could probably use a little bit of love.
如果我们想从更多的工程师那里征求反馈意见,或者在建立起始指标方面时间紧迫,我们可以重新措辞我们的面试问题以适应一组标准答案。这将使汇总回复变得更容易,并使我们能够更快地从中得出结论。但请注意,通过将您的同事开发人员的想法简化为一组分数,您将会失去一些您可能能够从面对面(或虚拟) 面试中收集到的细微差别。
If we want to solicit feedback from a larger group of engineers or are tight for time on establishing our starting metrics, we can rephrase our interview questions to fit a standard set of answers. This will make aggregating the responses easier and allow us to derive conclusions from them faster. Be warned, however, that by reducing your fellow developers’ thoughts to a set of scores, you’ll be stripping away some of the nuance that you might have been able to glean from an in-person (or virtual) interview.
从经验上看,面试往往能让你更灵活地探索那些坦诚的想法和话题。通常,来来回回的交流能带来最好的顿悟时刻。如果我们向开发人员发送一份包含长篇面试式问题的调查问卷,我们不仅无法实时要求受访者提供有关其答案的更多细节,而且我们收到的回复也可能会更少。我非常内疚,因为我打开了一份调查问卷,发现它是一系列六个开放式问题,我几乎立即提醒自己稍后再做。如果你想以调查问卷的形式征求工程师的反馈,请保持简短;这样,你就更有可能获得高回复率。
From experience, interviews tend to give you more flexibility to explore ideas and topics that bubble up candidly. It’s often the back and forth banter that brings out the best aha! moments. If we sent around a developer survey with long-form interview-like questions, not only would we not be able to ask the respondents in real time to provide more details about their answers, but we would likely get fewer responses. I’m very guilty of opening up a survey, noticing that it is a series of half-a-dozen open-ended questions, and almost immediately setting myself a reminder to do it later. If you want to solicit feedback from engineers in survey form, keep it short; this way, you have a better chance of getting a high response rate.
| 面试问题 | 调查问题 | 笔记 |
|---|---|---|
您使用 X 代码多久了? How long have you been working with X code? |
选择最能描述您使用 X 代码的时间的选项:> 6 个月;6 个月到 1 年;超过一年。 Select the option that best describes the amount of time you have spent working with X code: > 6 months; 6 months to 1 year; more than one year. |
请注意,在调查问卷版本中,您应该选择对您的工程组织最有意义的时间范围。在高增长、较年轻的公司中,范围可能以月为单位;在规模较大、较成熟的公司中,范围可能以年为单位。 Note that in the survey question version, you should choose time ranges that make the most sense for your engineering organization. At high-growth, younger companies, the ranges are probably on the order of months; at larger, more established companies, the ranges could be on the order of years. |
如果你可以改变使用 X 代码的一件事,你会改变什么?为什么? If you could change one thing about working with X code, what would it be? Why? |
如果您只能从列出的选项中选择一个来改善您使用 X 代码的体验,您会选择哪一个? If you could choose only one of the listed options to improve your experience working with X code, which one would it be? |
对于调查问题,请选择一些您认为影响最大的选项,并可选择提供一个填写字段。如果代码没有任何测试,请添加一个选项,说明代码已完全测试。如果大部分代码包含在数百行长的几个函数中,请添加一个选项,说明代码已拆分为小的模块化函数。 For the survey question, choose some options that you think would make the most impact and optionally provide a write-in field. If the code doesn’t have any tests, add an option that states that the code is fully tested. If a large proportion of the code is contained within a few functions that are hundreds of lines long, add an option that states that the code is split up into small, modular functions. |
告诉我你最近必须修复的一个涉及 X 代码的 bug。如何才能让它更容易解决? Tell me about a bug you recently had to fix that involved X code. What would have made it easier to solve? |
在下面列出的 Y 选项中,X 代码的哪些方面使得有效修复错误变得最困难? Of the Y options listed below, what about X code makes it the most difficult to fix bugs efficiently? |
|
您是否曾策略性地避免使用 X 代码(即修复问题区域之上或之下的级别的错误)?请告诉我您的经历。 Have you strategically avoided working in X code before (i.e., fixing a bug at a level above or below the problem area)? Tell me about that experience. |
从 1 到 5 进行评分,1 表示完全不可能,5 表示非常有可能,您找到避免更改 X 代码的方法的可能性有多大? On a scale from 1 to 5, 1 being not likely at all and 5 being very likely, how likely are you to find a way to avoid making changes to X code? |
|
X 代码的复杂性如何阻碍您开发新功能的能力? How does the complexity of X code hinder your ability to develop new features? |
1 表示非常不同意,5 表示非常同意,请对以下说法进行评分:X 代码的复杂性对我开发新功能所花费的时间有显著影响。 With 1 being strongly disagree and 5 being strongly agree, rate the following statement: The complexity of X code is a significant contributor to the time it takes for me to develop new features. |
|
X 代码的复杂性如何阻碍您测试和/或调试代码的能力? How does the complexity of X code hinder your ability to test and/or debug your code? |
1 表示非常不同意,5 表示非常同意,请对以下陈述进行评分:X 代码的复杂性是导致测试和/或调试我的代码困难的一个重要因素。 With 1 being strongly disagree and 5 being strongly agree, rate the following statement: The complexity of X code is a significant contributor to the difficulty to test and/or debug my code. |
|
X 代码的复杂性如何妨碍您审查其他开发人员对代码的更改的能力? How does the complexity of X code hinder your ability to review other developers’ changes to the code? |
1 表示非常不同意,5 表示非常同意,请对以下陈述进行评分:X 代码的复杂性是我审查其他开发人员对代码的更改所花费的时间和难度的一个重要因素。 With 1 being strongly disagree and 5 being strongly agree, rate the following statement: The complexity of X code is a significant contributor to the time and difficulty involved for me to review other developers’ changes to the code. |
声誉也会阻碍团队聘用和留住工程师的能力。假设您的公司以计费代码特别危险而闻名。虽然团队可能有少数致力于自己角色的开发人员,但在令人沮丧的复杂代码库中工作可能会损害士气。组织不愿意承认他们因代码质量和开发实践而失去工程师,但这种情况一直在发生。如果您能够收集有关工程师离开团队的原因的信息,并将其与代码复杂性联系起来,那么这可能是一个非常有说服力的指标,可以用于投入一些急需的资源进行重构。
Reputation can also hinder a team’s ability to hire and retain engineers. Say the billing code is known to be particularly treacherous at your company. While the team probably has a handful of developers who are committed to their roles, working in a frustratingly complex codebase can take a toll on morale. Organizations don’t like to admit that they’ve lost engineers due to code quality and development practices, but it happens all the time. If you’re able to collect information on engineers’ reasons for leaving the team and tie those back to code complexity, it can be an incredibly compelling metric for dedicating some much-needed resources to refactoring.
现在我们已经熟悉了各种潜在指标,我们必须选择使用哪些指标。为了构建对当前世界状况的最全面的看法,您必须确定最能说明您想要解决的具体问题的指标。这些指标中的任何一个都无法单独量化大型重构工作的许多独特方面,但结合起来,您可以构建问题的多方面特征。
Now that we’ve familiarized ourselves with a wide range of potential metrics, we have to choose which ones to use. To build the most comprehensive view of the current state of the world, you must identify the metrics that best illustrate the specific problems you want to address. None of these metrics alone can quantify the many unique aspects of a large refactoring effort, but combined, you can build a multifaceted characterization of the problem.
我建议从每个类别中选择一个指标。找到一种最合理的方法来估算代码复杂度,考虑到问题的性质和您已有的工具。生成一些测试覆盖率指标,确保您从正确的起点开始。确定一个正式文档的来源,您可以使用它来说明重构旨在解决的问题;同时用一些非正式文档来支持它。通过对版本控制数据进行切片和切分,收集有关热点和编程模式的信息。最后,通过与同事聊天来考虑代码的声誉。
I recommend picking one metric from every category. Find a way to approximate code complexity in a way that makes the most sense given the nature of your problem and the tools you have already available to you. Generate some test coverage metrics to make sure you start off on the right foot. Identify a source of formal documentation you can use to illustrate the problems your refactor aims to solve; back it up with some informal documentation as well. Gather information about your hotspots and programming patterns by slicing and dicing version control data. Last, consider the code’s reputation by chatting with your colleagues.
如果您发现这些指标中的大多数都可以帮助您量化要重构的代码的当前状态及其对组织的影响,请考虑选择最有可能显示显着改进的子集。这些指标将为您的队友以及最终的管理层提供最有说服力的案例。最后,您必须向您报告的人提出令人信服的论点,即您和您的队友准备投入到重构中的时间和精力将获得回报。
If you find that most of these metrics can help you quantify the current state of the code you are aiming to refactor and the impact it has on your organization, consider choosing the subset that has the greatest chance of showing significant improvements. These are the metrics that will make the most compelling case to your teammates and, ultimately, management. In the end, you’ll have to make a convincing argument to those you report to that the time and energy you and your teammates are ready to devote to the refactor will pay off.
我们已经成功收集了证据,帮助我们正确描述我们遇到的问题,但奠定基础只是难题的一个方面。接下来,我们必须使用收集到的数据来制定具体的执行计划。
We’ve successfully gathered evidence to help us properly characterize the problem we’re experiencing, but setting the stage is only one piece of the puzzle. Next, we have to use the data we’ve collected to assemble a concrete execution plan.
有一天,我计划完成从蒙特利尔到温哥华的 4,500 公里车程。从开始到结束,车程大约需要 48 小时,最快的路线覆盖了加拿大和美国边境的大部分长度。然而,最快的路线不一定是最有价值的路线,如果我再停下来参观渥太华的国会山、多伦多的标志性建筑加拿大国家电视塔和沉睡巨人省立公园,我的行程就会延长几个小时,大约 600 公里。
One day, I plan to complete the 4,500-kilometer drive between Montreal and Vancouver. The drive takes about 48 hours from start to finish, with the fastest route covering most of the length of the border between Canada and the United States. The fastest route isn’t necessarily the most rewarding route, however, and if I add a stop to see Parliament Hill in Ottawa, the iconic CN Tower in Toronto, and the Sleeping Giant Provincial Park, I lengthen my trip by a few hours and about 600 kilometers.
现在,任何踏上这段旅程的人都知道,从头到尾不停地开车既不切实际又危险。所以,在出发之前,我应该为这次公路旅行制定一个粗略的计划。我应该弄清楚在道路繁忙的日子里,我愿意开多长时间的车,以及我可能想去哪些城市观光。总的来说,我估计这次旅行可能需要 7 到 10 天,具体取决于我花在观光上的时间。这种灵活性可以应对一些意想不到的变数,无论我决定多花一天观光,还是被困在路边需要打电话求助。
Now anyone setting out on this journey knows driving it nonstop from start to finish is both impractical and dangerous. So, before I head out, I should map out a rough outline for the roadtrip. I should figure out how much time I’m comfortable driving on the road-heavy days, and which cities I might want to pop in to do some sightseeing. In total, I estimate the trip might take between seven and 10 days depending on how long I spend sightseeing. The flexibility allows for a few unexpected twists, whether I decide to sightsee an extra day or get stranded on the side of the road and need to call for assistance.
除了到达最终目的地之外,如何才能知道自己的公路旅行是否成功?如果你为旅行设定了预算,那么只要你的下一笔信用卡账单在可承受范围内,你可能就实现了目标。也许你想在沿途的每一站都吃一个汉堡。也许,你只是想看一些新的东西,和朋友或家人共度美好时光,留下一些新的回忆。虽然听起来很俗气,但公路旅行不仅关乎目的地,也关乎 旅程。
How do you know whether you’ve had a successful roadtrip beyond actually reaching your final destination? If you set a budget for your trip, you might have achieved your goal if your next credit card bill falls within range. Maybe you wanted to eat a burger at every stop along the way. Probably, you just wanted to see something new, spend some quality time with friends or family, and make a few new memories. As tacky as it might sound, the roadtrip is just as much about the journey as it is about the destination.
任何大型软件项目看起来都像是一次全国性的公路旅行。作为开发人员,我们会确定想要完成的一系列里程碑、想要在每个里程碑之间完成的一系列粗略任务,以及我们认为何时可能到达目的地的估计时间。我们会一路跟踪进度,确保我们专注于任务并在规定的时间内完成。到最后,我们希望看到以可持续的方式实现可衡量的积极影响。
Any large software endeavor can look quite a bit like a roadtrip across the country. As developers, we decide on a set of milestones we want to accomplish, a rough set of tasks we want to complete in between each of these milestones, and an estimate for when we think we might reach our destination. We keep track of our progress along the way, ensuring that we stay on task and within the time we’ve alloted ourselves. By the end, we want to see a measurable, positive impact, achieved in a sustainable way.
我们花时间了解了代码的过去,首先确定了代码是如何退化的,然后描述了这种退化。现在,我们准备规划它的未来。我们将学习如何将大型重构工作分解为最重要的部分,制定一个既全面又精确的计划。我们将在第3 章中重点介绍何时以及如何引用我们精心收集的指标来描述当前问题状态。我们将讨论向其他团队推销您的计划的重要性,并通过强调在整个过程中不断更新它的价值来总结。
We’ve taken the time to understand our code’s past, first by identifying how our code has degraded, then by characterizing that degradation. Now, we’re ready to map out its future. We’ll learn how to split up a large refactoring effort into its most important pieces, crafting a plan that is both thorough and precise in scope. We’ll highlight when and how to reference the metrics we carefully gathered to characterize the current problem state in Chapter 3. We’ll discuss the importance of shopping your plan around to other teams and wrap things up by emphasizing the value in continuously updating it throughout the whole process.
每个人制定执行计划的方法都不同。无论您的团队将其称为技术规范、产品简介还是征求意见 (RFC),它们都具有相同的目的:记录您打算做什么以及您打算如何做。制定清晰、简洁的计划是确保任何软件项目成功的关键,无论它涉及重构还是构建新功能;它让每个人都专注于手头的重要任务,并在整个过程中对他们的进展负责。
Everyone takes a different approach to building out an execution plan. Whether your team calls them technical specs, product briefs, or requests for comments (RFCs), they all serve the same purpose: documenting what you intend to do and how you intend to do it. Having a clear, concise plan is key to ensuring the success of any software project, regardless of whether it involves refactoring or building out a new feature; it keeps everyone focused on the important tasks at hand and enforces accountability for their progress throughout the endeavor.
我们的第一步是确定最终状态。我们应该已经充分了解我们目前所处的位置;我们在第 3 章中花了大量时间来衡量和定义我们想要解决的问题。现在我们已经扎根,我们需要确定我们想要落脚的地方。
Our first step is to define our end state. We should already have a strong understanding of where we currently are; we spent considerable time in Chapter 3 measuring and defining the problem we want to solve. Now that we’ve grounded ourselves, we need to identify where we want to land.
我们的公路旅行从我们目前居住的蒙特利尔开始。在海岸边散布的数百个城镇中,我们只能选择一个作为目标。因此,经过一番研究,我们决定前往温哥华。
We’re kicking off our roadtrip in Montreal, where we currently live. Of the hundreds of towns and cities speckled along that shore, we have to pick just one to aim for. So, after a bit of research, we decide to aim for Vancouver.
接下来,我们需要熟悉通往市区的高速公路,并决定抵达后可能想住在哪里。我们向住在温哥华或经常去那里的朋友寻求建议。我们来到了耶鲁镇,这是一个以水边的旧仓库建筑而闻名的街区。现在您的旅行有了明确的目的地,我们可以开始弄清楚如何到达那里。
Next, we need to familiarize ourselves with the highways leading directly into the city and decide where we might want to stay upon arrival. We reach out to friends who’ve either lived in Vancouver or who travel there frequently for recommendations. We land on Yaletown, a neighborhood known for its old warehouse buildings by the water. Now that your trip has a well-defined destination, we can start figuring out precisely how to get there.
为了说明本章中的许多重要概念,我们将使用一家拥有 15 年历史的生物技术公司的大规模重构示例,我们将其称为 Smart DNA, Inc.。该公司的大多数员工都是研究科学家,为几个存储库中的数百个 Python 脚本组成的复杂数据管道做出贡献。这些脚本部署在五个不同的环境中并在其中执行。所有这些环境都依赖于 Python 2.6 版本。不幸的是,Python 2.6 早已被弃用,导致该公司容易受到安全漏洞的攻击并无法更新重要的依赖项。虽然依赖过时的软件很不方便,但该公司并没有优先升级到较新的 Python 版本。考虑到现有的测试非常有限,这是一项大规模、冒险的任务。简而言之,这是该公司多年来最大的技术债务。
To illustrate the many important concepts in this chapter, we’ll be using an example of a large-scale refactor at a 15-year-old biotechnology company we’ll call Smart DNA, Inc. Most of its employees are research scientists, contributing to a complex data pipeline comprising hundreds of Python scripts across a few repositories. The scripts are deployed to and executed in five distinct environments. All of these environments rely on a version of Python 2.6. Unfortunately, Python 2.6 has long since been deprecated, leaving the company susceptible to security vulnerabilities and preventing it from updating important dependencies. While relying on outdated software is inconvenient, the company has not prioritized upgrading to a newer Python version. It’s a massive, risky undertaking, given the very limited testing in place. Simply put, this was the biggest piece of technical debt at the company for many years.
研究团队最近越来越担心无法使用较新版本的核心库。鉴于升级对业务来说非常重要,我们已着手研究如何将每个存储库和环境迁移到 Python 2.7。
The research team has recently grown concerned about its inability to use newer versions of core libraries. Given that the upgrade is now important to the business, we’ve been tasked with figuring out how to migrate each of the repositories and environments to use Python 2.7.
研究团队使用 来管理依赖项pip。每个存储库都有自己的依赖项列表,这些依赖项都编码在requirements.txt中。由于这些不同的
requirements.txt文件,团队在切换项目时很难记住给定项目上安装了哪些依赖项。这还需要软件团队审核每个文件并单独升级以与 Python 2.7 兼容。因此,软件团队决定,虽然这不是必需的,但统一存储库并因此统一依赖项可以让他们更轻松地升级 Python 2.7(并简化研究人员的开发过程
)。
The research team manages its dependencies by using pip. Each repository has its own list of dependencies, encoded in a requirements.txt. Having these distinct
requirements.txt files has made it difficult for the team to remember which dependencies are installed on a given project when switching between projects. It also would require the software team to audit each file and upgrade it to be compatible with Python 2.7 independently. As a result, the software team decided that although it was not necessary, it would make the Python 2.7 upgrade easier for them (and simplify the researchers’ development process) to unify the repositories and thus unify the
dependencies.
我们的执行计划应明确列出所有起始指标和目标结束指标,并附加一个可选但有用的附加列来记录实际观察到的结束状态。对于 Python 迁移,起始指标集很明确:每个存储库都有一个不同的依赖项列表,每个环境都运行 Python 2.6。所需的指标集同样简单:让每个业务环境都运行 Python 2.7,并在一个地方管理一组清晰、简洁的必需库。表 4-1显示了我们列出 Smart DNA 指标的示例。
Our execution plan should clearly outline all starting metrics and target end metrics, with an optional, albeit helpful, additional column to record the actual, observed end state. For the Python migration, the starting set of metrics was clear: each repository had a distinct list of dependencies, with each environment running Python 2.6. The desired set of metrics was equally simple: have each of the business’s environments running Python 2.7, with a clear, succinct set of required libraries managed in a single place. Table 4-1 shows an example where we’ve listed Smart DNA’s metrics.
| 指标描述 | 开始 | 目标 | 观察到 |
|---|---|---|---|
环境 1 Environment 1 |
Python 2.6.5 Python 2.6.5 |
Python 2.7.1 Python 2.7.1 |
- - |
环境 2 Environment 2 |
Python 2.6.1 Python 2.6.1 |
Python 2.7.1 Python 2.7.1 |
- - |
环境 3 Environment 3 |
Python 2.6.5 Python 2.6.5 |
Python 2.7.1 Python 2.7.1 |
- - |
环境 4 Environment 4 |
Python 2.6.6 Python 2.6.6 |
Python 2.7.1 Python 2.7.1 |
- - |
环境 5 Environment 5 |
Python 2.6.6 Python 2.6.6 |
Python 2.7.1 Python 2.7.1 |
- - |
不同依赖项列表的数量 Number of distinct lists of dependencies |
3 3 |
1 1 |
- - |
您可以随意提供理想的最终状态和可接受的最终状态。有时,完成 80% 即可获得重构的 99% 的好处,而达到 100% 所需的额外工作量根本不值得。
Feel free to provide both an ideal end state and an acceptable end state. Sometimes, getting 80 percent of the way there gives you 99 percent of the benefit of the refactor, and the additional amount of work required to get to 100 percent simply isn’t worthwhile.
接下来,我们要绘制起始状态和结束状态之间的最直接路径。这应该能让我们很好地估算出执行项目所需的时间。在最短路径上构建可确保您的计划在引入中间步骤时始终保持原有的路线。
Next, we want to map the most direct path between our start and end states. This should give us a good lower-bound estimate on the amount of time required to execute our project. Building on a minimal path ensures that your plan stays true to its course as you introduce intermediate steps along the way.
因此,对于我们的公路旅行,我们快速搜索了一下,看看蒙特利尔和温哥华之间最直接的路线是什么样的(图 4-1)。假设交通状况最差,如果我们从蒙特利尔出发,不间断地向西行驶,似乎需要 47 个小时。
So, for our roadtrip, we do a quick search to see what the most direct route between Montreal and Vancouver looks like (Figure 4-1). Presuming minimal traffic, it appears to take 47 hours if we were to leave Montreal and drive nonstop westward.
我们可以通过确定每天适合驾驶几个小时,并将这段时间平均分配到大约 47 个小时来确定更合理的行程下限。如果我们想花 8 个小时开车,那就需要大约 6 天的时间。
We can determine a more reasonable lower bound for our trip by deciding how many hours we’re comfortable driving per day and splitting that up evenly over the approximate 47 hours. If we want to commit to eight hours of driving, it’ll take us just about six days.
现在我们已经绘制了两点之间的最短路径,我们可以开始找出我们想要改变的主要复杂因素或总体策略。直接路线的一个特点是,它的大部分路程穿越美国,而不是加拿大。如果我们想将我们的行程限制在北纬 49 度以北的地区,我们将在旅途中增加一两个小时。然而,因为它确实降低了旅行的整体复杂性(不需要携带护照或担心在过境时浪费时间),我们将选择留在加拿大(图 4-2)。
Now that we’ve mapped the shortest possible path between the two points, we can start to pick out any major complications or overarching strategies we want to change. One peculiarity of the direct route is that the vast majority of it travels across the United States, not Canada. If we want to restrict our drive to the area north of the 49th parallel, we’d be adding an extra hour or two to the trip. However, because it does reduce the overall complexity of the trip (no need to carry our passport or worry about time wasted at a border crossing), we’ll opt to stay in Canada (Figure 4-2).
不幸的是,用于软件项目的 Google 地图尚未问世。那么我们如何确定从现在到项目完成的最短路径呢?我们可以通过以下几种方式来实现:
Unfortunately, Google Maps for software projects doesn’t exist quite yet. So how do we determine the shortest path from now to project completion? We can do this in a couple of ways:
打开一个空白文档,花 15 到 20 分钟(或者直到你想不出为止)写下你能想到的每个步骤。将文档放在一边至少几个小时(最好是一两天),然后再次打开它并尝试按时间顺序排列每个步骤。当你开始对步骤进行排序时,继续问自己是否每个步骤都是实现最终目标所绝对需要的。如果不是,请将其删除。一旦你有了一组有序的步骤,请重新阅读该过程。填补出现的任何明显空白。如果任何步骤定义非常不明确,请不要担心;目标只是提供完成项目所需的最少步骤。这不会是最终产品。
Open a blank document and for 15 to 20 minutes (or until you’ve run out of ideas), write down every step you can come up with. Set the document aside for at the very least a few hours (ideally a day or two), then open it up again and try to order each step in chronological order. As you begin to order the steps, continue to ask yourself whether each is absolutely required to reach the final goal. If not, remove it. Once you have an ordered set of steps, reread the procedure. Fill in any glaring gaps as they arise. Don’t worry if any steps are terribly ill-defined; the goal is only to produce the minimum set of steps required to complete your project. This won’t be the final product.
召集几个对项目感兴趣或你知道他们会做出贡献的同事。留出一个小时左右的时间。为你们每个人准备一包便签和一支笔。在 15 到 20 分钟内(或直到每个人的笔都放下),写下你认为需要的每个步骤,每个步骤都写在单独的便签上。然后,让第一个人按时间顺序列出他们的步骤。随后的队友查看他们自己的每张便签,并将其与副本配对或将其插入时间线内的适当位置。一旦每个人都整理好所有笔记,就查看每个步骤,并询问房间里的人是否认为该步骤对于实现目标绝对必要。如果不是,就丢弃它。最终产品应该是一组合理的最小步骤。(你可以轻松地将此方法应用于分布式团队,方法是将所有单独集思广益的步骤合并到共同共享的文档中。无论哪种方式,练习的最终输出都应该是一份易于分发和协作改进的书面文件。)
Gather a few coworkers who are either interested in the project or you know will be contributing. Set aside an hour or so. Grab a pack of sticky notes and a pen for each of you. For 15 to 20 minutes (or until everyone’s pens are down), write down every step you think is required, each on individual sticky notes. Then, have a first person lay out their steps in chronological order. Subsequent teammates go through each of their own sticky notes and either pair them up with their duplicates or insert them into the appropriate spot within the timeline. Once everyone’s organized all of their notes, go through each step and ask the room whether they believe that the step is absolutely required in order to reach the goal. If not, discard it. The final product should be a reasonable set of minimal steps. (You can easily adapt this method for distributed teams by combining all individually brainstormed steps into a jointly shared document. Either way, the final output of the exercise should be a written document that is easy to distribute and collaboratively improve.)
如果这两种方法都不适合您,那也没关系!使用您认为最有效的任何方法。只要您能够列出您认为可以实现目标的直接途径的步骤列表,无论这些步骤定义得多么模糊,您都已成功完成了这一关键步骤。
If neither of these options works for you, that’s all right! Use whatever method you find most effective. As long as you are able to produce a list of steps you believe model a direct path to achieving your goal, no matter how ill-defined they might be, you’ve successfully completed this critical step.
Smart DNA 团队在会议室里花了几个小时集思广益,想出了使用新版 Python 来获取所有服务所需的步骤。他们在白板上画了一条时间线。最左边是他们的起点,最右边是他们的目标。团队成员轮流列出沿途的重要步骤,并将它们按顺序排列。集思广益得出的部分步骤如下:
The team at Smart DNA gathered into a conference room for a few hours to brainstorm the steps required to get all services using a newer version of Python. On a whiteboard, they started out by drawing a timeline. On the far left was their starting point and, on the far right, their goal. Teammates alternated listing important steps along the way, slotting them in along the line. A subset of the brainstormed steps are as follows:
手动构建每个存储库中所有包的单一列表。
Build a single list of all the packages across each of the repositories manually.
将列表缩小到仅包含必要的包。
Narrow the list to just the necessary packages.
确定在 Python 2.7 中每个包应该升级到哪个版本。
Identify which version each package should be upgraded to in Python 2.7.
使用所有必需的包构建一个 Docker 容器。
Build a Docker container with all the required packages.
在每个环境上测试 Docker 容器。
Test the Docker container on each of the environments.
为每个存储库找到测试;确定哪些测试是可靠的。
Locate tests for each repository; determine which tests are reliable.
将所有存储库合并为一个存储库。
Merge all the repositories into a single repository.
选择一个 linter 和相应的配置。
Choose a linter and corresponding configuration.
将 linter 集成到持续集成中。
Integrate the linter into continuous integration.
使用 linter 来识别代码中的问题(未定义的变量、语法错误等)。
Use the linter to identify problems in the code (undefined variables, syntax errors, etc.).
修复 linter 发现的问题。
Fix problems the linter identified.
在所有环境上安装 Python 2.7.1 并测试。
Install Python 2.7.1 on all environments and test.
在低风险脚本子集上使用 Python 2.7。
Use Python 2.7 on a subset of low-risk scripts.
将 Python 2.7 推广到所有脚本。
Roll out Python 2.7 to all scripts.
从我们的子集中我们可以看出,有些可以并行化或重新排序,而其他的则应该进一步细分。在这个过程的这个阶段,我们的重点是大致了解所涉及的步骤;我们将在整个章节中完善这个过程。
We can see from our subset that some can be parallelized, or reordered, and others should be broken down into further detail. At this point in the process, our focus is on getting a rough sense of the steps involved; we’ll refine the process throughout the chapter.
接下来,我们将使用我们得出的程序来得出一个有序的中间里程碑列表。这些里程碑不需要大小相似或分布均匀,只要它们可以在一个让人感觉舒适的时间范围内实现即可。我们应该专注于寻找本身就有意义的里程碑。也就是说,要么达到里程碑本身就是一种胜利,要么它定义了一个我们可以在必要时轻松停止的步骤(或两者兼而有之)。如果您能够尽早确定既有意义又能展示重构工作的潜在影响的里程碑,那么您就做得很好!
We’ll next use the procedure we derived to come up with an ordered list of intermediate milestones. These milestones do not need to be of similar size or evenly distributed, as long as they are achievable within a timescale that feels comfortable. We should focus on finding milestones that are meaningful in and of themselves. That is, either reaching the milestone is a win on its own, or it defines a step we could comfortably stop at if necessary (or both). If you can identify milestones that are both meaningful and showcase the potential impact of your refactoring effort early, then you’re doing great!
在温尼伯和温哥华之间的这段旅程中,我们向一些朋友和家人询问了值得游览的景点和值得做的事情。在权衡了他们的建议和我们自己的兴趣之后,我们制定了一个粗略的行程,其中包括从露营到参观博物馆、品尝美味佳肴以及几次拜访大家庭的一切(图 4-3)。但这些景点从未让我们偏离路线。
For the stretch of the trip between Winnipeg and Vancouver, we ask some friends and family for recommendations of sights to see and things to do. After weighing their suggestions with our own interests, we come up with a rough itinerary, which includes everything from camping to museum visits, tasty pitstops, and a few visits to extended family (Figure 4-3). But at no point do any of these points of interest take us radically off course.
我们可以采用类似的策略来缩小重构工作的里程碑。对于我们之前集思广益的每个步骤,我们可以问自己这些 问题:
We can apply similar tactics to narrow in on our milestones for our refactoring effort. For each of the steps we brainstormed previously, we can ask ourselves these questions:
让我们回顾一下之前在“工作中”中概述的示例。一个合乎逻辑且可行的里程碑可能是将每个不同的存储库合并为一个存储库,以方便使用。Smart DNA 的软件团队预计需要六周时间才能正确合并存储库,而不会打乱研究团队的开发流程。由于软件团队习惯于以更快的速度交付,并且成员们担心如果他们在迁移过程中过早地合并存储库会影响士气,因此他们决定采用一个更简单的初始里程碑:生成一个requirements.txt文件来包含每个存储库的所有包依赖项。通过花时间尽早减少依赖项集,他们简化了研究团队的开发流程,朝着实现存储库合并迈出了重要一步,而所有这些都是在迁移到 Python 2.7 完成之前完成的。
Let’s refer back to our previous example, outlined in “At Work”. A logical, feasible milestone might be to combine each of the distinct repositories into a single repository for convenience. The software team at Smart DNA anticipates that it’ll take six weeks to merge the repositories properly, without disrupting the research team’s development process. Because the software team is accustomed to shipping at a quicker pace, and the members are concerned about morale if they set out to merge the repositories too early in the migration, they decide on a simpler initial milestone: generating a single requirements.txt file to encompass all package dependencies for each of the repositories. By taking the time to reduce the set of dependencies early, they are simplifying the development process for the research team, taking a substantial step toward enabling the merging of the repositories, and all of that well before the migration to Python 2.7 is complete.
在选择主要里程碑时,我们应该尽早并经常优化能够展示重构优势的步骤。其中一种方法是专注于那些完成后能为其他工程师带来直接价值的步骤。这有望提高您的团队和受您的变更影响的其他工程师的士气。
When choosing major milestones, we should optimize for steps that demonstrate the benefits of the refactor early and often. One way to do that is to focus on steps that, upon completion, derive immediate value for other engineers. This should hopefully increase the morale of both your team and other engineers affected by your changes.
在确定 Python 迁移范围时,我们注意到没有一个存储库使用任何持续集成来检查拟议代码更改中的常见问题。我们知道,检查现有代码可以帮助我们找出在 Python 2.7 中执行代码时可能遇到的问题。我们还知道,启用简单、自动的检查步骤可以促进整个研究团队在未来几年内更好地进行编程实践。事实上,它似乎非常有价值,以至于在不同情况下,建立自动检查步骤可能是一个独立的项目。这向我们表明,这是一个有意义的、重要的中间步骤。
When scoping out the Python migration, we noticed that none of the repositories used any continuous integration to lint for common problems in the proposed code changes. We know that linting the existing code could help us pinpoint problems we risk encountering when executing it in Python 2.7. We also know that enabling a simple, automatic linting step could promote better programming practices for the entire research team for years to come. In fact, it seems so valuable that under different circumstances, instituting an automatic linting step might have been a project all on its own. This indicated to us that it was a meaningful, significant intermediate step.
在理想情况下,我们不必考虑业务优先级的变化、事件或重组。不幸的是,无论在哪个行业,这些都是工作中的现实。这就是为什么最好的计划要考虑到意外情况。考虑破坏性变化的一种方法是将我们的项目划分为不同的部分,以便在我们需要暂停开发的不太可能的情况下独立存在。
In a perfect world, we wouldn’t have to account for shifts in business priorities, incidents, or reorganizations. Unfortunately, these are all a reality of working, regardless of the industry. This is why the best plans account for the unexpected. One way of accounting for disruptive changes is by dividing our project into distinct pieces that can stand alone in the unlikely event that we need to pause development.
在我们的 Python 示例中,我们可以轻松地在修复 linter 突出显示的所有错误和警告之后暂停项目,但在开始使用新版本运行脚本子集之前。根据我们处理重构的方式,中途暂停可能会让在存储库中积极工作的研究人员感到困惑。如果出于某种原因需要暂停重构,那么在开始使用 Python 2.7 运行脚本子集之前立即暂停是安全的;我们仍然会在实现总体目标方面取得相当大的进展,并且在下次能够恢复时有一个干净、方便的地方来重新开始。
With our Python example, we could comfortably pause the project after fixing all errors and warnings the linter highlighted, but before beginning to run a subset of scripts by using the new version. Depending on how we tackled the refactor, pausing halfway through could risk confusing the researchers actively working in the repository. If the refactor needed to be paused for whatever reason, pausing immediately before we started running a subset of scripts using Python 2.7 would be safe; we would still have made considerable progress toward our overall goal and have a clean, easy place to pick things back up when we were next able to.
在花时间强调战略里程碑之后,我们重新组织了执行计划,以突出这些步骤并相应地分组子任务。更完善的计划如下:
After taking the time to highlight strategic milestones, we reorganized our execution plan to highlight these steps and grouped subtasks accordingly. The more refined plan is as follows:
创建单个requirements.txt文件。
枚举每个存储库中使用的所有包。
审核所有软件包并将列表缩小到仅包含相应版本的必需软件包。
确定在 Python 2.7 中每个包应该升级到哪个版本。
Create a single requirements.txt file.
Enumerate all packages used across each of the repositories.
Audit all packages and narrow down the list to only required packages with corresponding versions.
Identify which version each package should be upgraded to in Python 2.7.
将所有存储库合并为一个存储库。
创建一个新的存储库。
对于每个存储库,使用 git submodules 添加到新存储库。
Merge all the repositories into a single repository.
Create a new repository.
For each repository, add to the new repository using git submodules.
使用所有必需的软件包构建 Docker 映像。
在每个环境上测试 Docker 映像。
Build a Docker image with all the required packages.
Test the Docker image on each of the environments.
通过对 Mono 存储库 (monorepo)的持续集成启用 linting 。
选择一个 linter 和相应的配置。
将 linter 集成到持续集成中。
使用 linter 识别代码中的逻辑问题(未定义的变量、语法错误等)。
Enable linting through continuous integration for the mono repository (monorepo).
Choose a linter and corresponding configuration.
Integrate the linter into a continuous integration.
Use the linter to identify logical problems in the code (undefined variables, syntax errors, etc.).
在所有环境中安装并推出 Python 2.7.1。
为每个存储库找到测试;确定哪些测试是可靠的。
在低风险脚本子集上使用 Python 2.7。
将 Python 2.7 推广到所有脚本。
Install and roll out Python 2.7.1 in all environments.
Locate tests for each repository; determine which tests are reliable.
Use Python 2.7 on a subset of low-risk scripts.
Roll out Python 2.7 to all scripts.
希望在确定了关键里程碑之后,您有一个感觉平衡、可实现且有益的程序。但需要注意的是,这并不是一门完美的科学。根据所需步骤所涉及的工作量及其相对影响,权衡它们之间的区别可能非常困难。在我们的案例研究章节(第10章和第11章)中,我们将看到一个示例,说明我们如何在战略性地规划大规模重构时权衡这些考虑因素。
Hopefully, after you’ve identified key milestones, you have a procedure that feels balanced, achievable, and rewarding. It’s important to note, however, that this isn’t a perfect science. It can be quite difficult to weigh required steps against one another according to the effort they involve and their relative impact. We’ll see an example of how we decided to weigh each of these considerations when strategically planning a large-scale refactor in both of our case study chapters, Chapters 10 and 11.
最后,一旦我们确定了最终状态和关键里程碑,我们就希望在最终状态和每个战略中间里程碑之间插入中间步骤。这样,我们就可以专注于最关键的部分,同时制定详细的计划。
Finally, once we have our end state and our key milestones, we want to interpolate our way through the intermediate steps between our end state and each of our strategic intermediary milestones. This way, we maintain focus on the most critical pieces, all while building out a detailed plan.
在这里,我们可以花一些时间来确定重构的某些部分是否与顺序无关;也就是说,它们是否可以在任何时候完成,并且几乎没有先决条件。例如,假设您已经确定了项目的几个关键里程碑;我们将它们称为 A、B、C 和 D。您注意到您需要先完成 A,然后才能处理 B 或 C,并且需要先完成 B,然后才能处理 D。关于 C,您有三个选项:您可以同时并行开发 C 和 D,先完成 C,然后完成 D,或者先完成 D,然后完成 C。
This is where we can spend some time figuring out whether certain portions of the refactor are order-agnostic; that is, whether they can be completed at any point, with very few or no prerequisites. For example, let’s say you’ve identified a few key milestones for your project; we’ll call them A, B, C, and D. You notice that you need to complete A before tackling B or C, and B needs to be completed before you tackle D. You have three options concerning C: you could parallelize development on C at the same time as D, complete C and then D, or complete D followed by C.
如果您预感到里程碑 B 将是一个困难且漫长的里程碑,而里程碑 D 看起来同样具有挑战性,那么您可能希望通过将里程碑 C 放在 B 和 D 之间来打破僵局。这应该有助于提高士气,并在您进行漫长的重构时为团队的动力增添一些活力。另一方面,如果您认为可以轻松地并行完成里程碑 C 和 D 的工作,并提前完成项目,那么这可能也是一个值得的选择。
If you have a hunch that B is going to be a difficult, lengthy milestone and D looks just as challenging, you might want to break things up by putting milestone C between B and D. This should help boost morale and add some pep to the team’s momentum as you work through a long refactor. On the other hand, if you think that you can comfortably parallelize work on milestone C and D, and wrap up the project a little bit sooner, then that might be a worthwhile option as well.
这一切都归结为平衡每个必要步骤所需的时间和精力,同时考虑它们对代码库和团队福祉的影响。
It all comes down to balancing the time and effort associated with each requisite step, all the while considering their impact on your codebase and the well-being of your team.
为重构工作制定周到的推出策略可能会决定是取得巨大成功还是彻底失败。因此,将其作为执行计划的一部分绝对至关重要。如果您的重构涉及多个不同的阶段,每个阶段都有自己的推出策略,请务必在每个阶段的最后步骤中概述每个阶段。虽然各种团队都使用各种各样的部署实践,但在本节中,我们将仅讨论特定于执行持续部署的团队的推出策略。
Having a thoughtful rollout strategy for your refactoring effort can make the difference between great success and utter failure. Therefore, it is absolutely critical to include it as part of your execution plan. If your refactor involves multiple distinct phases, each with its own rollout strategy, be certain to outline each of these among the concluding steps of each phase. Although teams of all kinds use a great variety of deployment practices, in this section, we’ll only discuss rollout strategies specific to teams that perform continuous deployment.
通常,采用持续部署的产品工程团队将开始开发新功能,并在整个过程中以手动和自动方式对其进行测试。当所有条件都符合后,该功能将谨慎地逐步推广给实际用户。在最终推广阶段之前,许多团队会将该功能部署到其产品的内部版本中,这让他们有机会在开始向用户部署之前再次排除问题。在这种情况下衡量成功很容易;如果该功能按预期运行,那就太好了!如果我们发现任何错误,我们会设计修复程序,并根据该修复的影响,重复增量推广过程或立即将其推广给所有用户。
Typically, product engineering teams that employ continuous deployment will begin development on a new feature, testing it both manually and in an automated fashion throughout the process. When all the boxes have been checked, the feature is carefully, incrementally rolled out to live users. Before the final rollout phase, many teams will deploy the feature to an internal build of their product, giving themselves yet another opportunity to weed out problems before kicking off deployment to users. Measuring success in this case is easy; if the feature works as expected, great! If we find any bugs, we devise a fix, and depending on the implications of that fix, either repeat the incremental rollout process or push it out to all users immediately.
在持续部署环境中,使用功能标记在运行时持续隐藏、启用或禁用特定功能或代码路径是一种常见做法。良好的功能标记解决方案允许开发团队灵活地将用户组分配给特定功能(有时根据许多不同的属性)。例如,如果您在开发社交媒体应用程序,您可能希望向单个地理区域内的所有用户、全球随机 1% 的用户或所有年龄在 40 岁以上的用户发布某项功能。
It’s common practice in continuous deployment environments to use feature flags to hide, enable, or disable specific features or code paths continually at runtime. Good feature flag solutions allow development teams the flexibility to assign groups of users to specific features (sometimes according to a number of different attributes). If you work on a social media application, for instance, you might want to release a feature to all users within a single geographic area, a random 1 percent of users globally, or all users who are over the age of 40.
对于重构项目,虽然我们肯定希望尽早频繁地测试我们的更改,并非常小心地将其推广给用户,但确定一切是否按预期工作却相当棘手。毕竟,成功的关键指标之一是没有行为发生改变。确定缺乏变化比发现哪怕是最小的变化要困难得多。因此,我们可以确定重构没有引入任何新错误的最简单方法之一是通过编程将重构前的行为与重构后的行为进行比较。
With refactoring projects, while we most certainly want to test our changes early and frequently, and very carefully roll it out to users, it’s quite a bit trickier to determine whether everything is working as intended. After all, one of the key success metrics is that no behavior has changed. It is much more difficult to ascertain a lack of change than to discover even the smallest change. So, one of the easiest ways we can ascertain that the refactor hasn’t introduced any new bugs is by programmatically comparing pre-refactor behavior with post-refactor behavior.
我们可以采用我们在 Slack 中创造的明暗技术来比较重构前和重构后的行为。它的工作原理如下。
We can compare pre-refactor and post-refactor behavior by employing what we’ve coined at Slack as the light/dark technique. Here’s how it works.
首先,将重构后的逻辑与当前逻辑分开实现。示例 4-1以小规模描述了此步骤。
First, implement the refactored logic separately from the current logic. Example 4-1 depicts this step on a small scale.
// Linear search; this is the old implementationfunctionsearch(name,alphabeticalNames){for(leti=0;i<alphabeticalNames.length;i++){if(alphabeticalNames[i]==name)returni;}return-1;}// Binary search; this is the new implementationfunctionsearchFaster(name,alphabeticalNames){letstartIndex=0;letendIndex=alphabeticalNames.length-1;while(startIndex<=endIndex){letmiddleIndex=Math.floor((startIndex+endIndex)/2);if(alphabeticalNames[middleIndex]==name)returnmiddleIndex;if(alphabeticalNames[middleIndex]>name){endIndex=middleIndex-1;}elseif(alphabeticalNames[middleIndex]<name){startIndex=middleIndex+1;}}return-1;}
// Linear search; this is the old implementationfunctionsearch(name,alphabeticalNames){for(leti=0;i<alphabeticalNames.length;i++){if(alphabeticalNames[i]==name)returni;}return-1;}// Binary search; this is the new implementationfunctionsearchFaster(name,alphabeticalNames){letstartIndex=0;letendIndex=alphabeticalNames.length-1;while(startIndex<=endIndex){letmiddleIndex=Math.floor((startIndex+endIndex)/2);if(alphabeticalNames[middleIndex]==name)returnmiddleIndex;if(alphabeticalNames[middleIndex]>name){endIndex=middleIndex-1;}elseif(alphabeticalNames[middleIndex]<name){startIndex=middleIndex+1;}}return-1;}
然后,如示例 4-2所示,将逻辑从当前实现重新定位到单独的函数。
Then, as shown in Example 4-2, relocate the logic from the current implementation to a separate function.
// Existing function now calls into relocated implementationfunctionsearch(name,alphabeticalNames){returnsearchOld(name,alphabeticalNames);}// Linear search logic moved to a new function.functionsearchOld(name,alphabeticalNames){for(leti=0;i<alphabeticalNames.length;i++){if(alphabeticalNames[i]==name)returni;}return-1;}// Binary search; this is the new implementationfunctionsearchFaster(name,alphabeticalNames){letstartIndex=0;letendIndex=alphabeticalNames.length-1;while(startIndex<=endIndex){letmiddleIndex=Math.floor((startIndex+endIndex)/2);if(alphabeticalNames[middleIndex]==name)returnmiddleIndex;if(alphabeticalNames[middleIndex]>name){endIndex=middleIndex-1;}elseif(alphabeticalNames[middleIndex]<name){startIndex=middleIndex+1;}}return-1;}
// Existing function now calls into relocated implementationfunctionsearch(name,alphabeticalNames){returnsearchOld(name,alphabeticalNames);}// Linear search logic moved to a new function.functionsearchOld(name,alphabeticalNames){for(leti=0;i<alphabeticalNames.length;i++){if(alphabeticalNames[i]==name)returni;}return-1;}// Binary search; this is the new implementationfunctionsearchFaster(name,alphabeticalNames){letstartIndex=0;letendIndex=alphabeticalNames.length-1;while(startIndex<=endIndex){letmiddleIndex=Math.floor((startIndex+endIndex)/2);if(alphabeticalNames[middleIndex]==name)returnmiddleIndex;if(alphabeticalNames[middleIndex]>name){endIndex=middleIndex-1;}elseif(alphabeticalNames[middleIndex]<name){startIndex=middleIndex+1;}}return-1;}
然后,将前一个函数转换为抽象,有条件地调用任一实现。在暗模式下,两个实现都会被调用,比较结果,并返回旧实现的结果。在亮模式下,两个实现都会被调用,比较结果,并返回新实现的结果。如示例 4-3所示,重新利用现有的函数定义可以让我们尽可能少地修改代码。(虽然我们的示例中没有描述,但为了防止在亮/暗过程中性能下降,应该同时执行新旧实现。)
Then, transform the previous function into an abstraction, conditionally calling either implementation. During dark mode, both implementations are called, the results are compared, and the results from the old implementation are returned. During light mode, both implementations are called, the results are compared, and the results from the new implementation are returned. As can be seen in Example 4-3, repurposing the existing function definition allows us to modify as little code as possible. (Though not depicted in our example, to prevent performance degradations as part of the light/dark process, both the old and new implementations should be executed concurrently.)
// Existing function now an abstraction for calling into either implementationfunctionsearch(name,alphabeticalNames){// If we're in dark mode, return the old result.if(darkMode){constoldResult=searchOld(name,alphabeticalNames);constnewResult=searchFaster(name,alphabeticalNames);compareAndLog(oldResult,newResult);returnoldResult;}// If we're in light mode, return the new result.if(lightMode){constoldResult=searchOld(name,alphabeticalNames);constnewResult=searchFaster(name,alphabeticalNames);compareAndLog(oldResult,newResult);returnnewResult;}returnsearch(name,alphabeticalNames);}// Linear search logic moved to a new function.functionsearchOld(name,alphabeticalNames){for(leti=0;i<alphabeticalNames.length;i++){if(alphabeticalNames[i]==name)returni;}return-1;}// Binary search; this is the new implementationfunctionsearchFaster(name,alphabeticalNames){letstartIndex=0;letendIndex=alphabeticalNames.length-1;while(startIndex<=endIndex){letmiddleIndex=Math.floor((startIndex+endIndex)/2);if(alphabeticalNames[middleIndex]==name)returnmiddleIndex;if(alphabeticalNames[middleIndex]>name){endIndex=middleIndex-1;}elseif(alphabeticalNames[middleIndex]<name){startIndex=middleIndex+1;}}return-1;}functioncompareAndLog(oldResult,newResult){if(oldResult!=newResult){console.log(`Diff found; old result:${oldResult}, new result:${newResult}`);}}
// Existing function now an abstraction for calling into either implementationfunctionsearch(name,alphabeticalNames){// If we're in dark mode, return the old result.if(darkMode){constoldResult=searchOld(name,alphabeticalNames);constnewResult=searchFaster(name,alphabeticalNames);compareAndLog(oldResult,newResult);returnoldResult;}// If we're in light mode, return the new result.if(lightMode){constoldResult=searchOld(name,alphabeticalNames);constnewResult=searchFaster(name,alphabeticalNames);compareAndLog(oldResult,newResult);returnnewResult;}returnsearch(name,alphabeticalNames);}// Linear search logic moved to a new function.functionsearchOld(name,alphabeticalNames){for(leti=0;i<alphabeticalNames.length;i++){if(alphabeticalNames[i]==name)returni;}return-1;}// Binary search; this is the new implementationfunctionsearchFaster(name,alphabeticalNames){letstartIndex=0;letendIndex=alphabeticalNames.length-1;while(startIndex<=endIndex){letmiddleIndex=Math.floor((startIndex+endIndex)/2);if(alphabeticalNames[middleIndex]==name)returnmiddleIndex;if(alphabeticalNames[middleIndex]>name){endIndex=middleIndex-1;}elseif(alphabeticalNames[middleIndex]<name){startIndex=middleIndex+1;}}return-1;}functioncompareAndLog(oldResult,newResult){if(oldResult!=newResult){console.log(`Diff found; old result:${oldResult}, new result:${newResult}`);}}
一旦抽象正确到位,就开始启用暗黑模式(即双代码路径执行,返回旧代码的结果)。监控两个结果集之间记录的任何差异。追踪并修复新实现中导致这些差异的任何潜在错误。重复此过程,直到您正确处理所有差异,从而向更广泛的用户群体启用暗黑模式。
Once the abstraction has been properly put in place, start enabling dark mode (i.e., dual code path execution, returning the results of the old code). Monitor any differences being logged between the two result sets. Track down and fix any potential bugs in the new implementation causing those discrepancies. Repeat this process until you’ve properly handled all discrepancies, enabling dark mode to broader groups of users.
所有用户都选择启用暗黑模式后,从风险最低的环境开始,开始向一小部分用户启用浅色模式(即开始从新代码路径返回数据)。继续记录结果集中的任何差异;如果其他开发人员正在积极处理相关代码,并冒着对旧实现引入新实现未反映的更改的风险,这可能会很有用。继续让更广泛的用户群体选择浅色模式,直到每个人都能成功处理新实现的结果。
Once all users have been opted in to dark mode, starting with the lowest-risk environments, begin enabling light mode to small subsets of users (i.e., start returning data from the new code path). Continue logging any differences in the result sets; this can be useful if other developers are actively working on related code and risk introducing a change to the old implementation that is not reflected in the new implementation. Continue to opt broader groups of users into light mode, until everyone is successfully processing results from the new implementation.
最后,禁用两种代码路径的执行,继续监控任何已报告的错误,并删除抽象、功能标记和条件执行逻辑,一旦重构对用户生效了足够长的时间(无论您的用例是多长),就完全删除旧逻辑。只有新实现应该保留在旧实现曾经所在的位置。有关示例,请参阅示例 4-4。
Finally, disable execution of both code paths, continuing to monitor for any reported bugs, and remove the abstraction, feature flags, and conditional execution logic and, once the refactor has been live to users for an adequate period (whatever that might be for your use case), remove the old logic altogether. Only the new implementation should remain where the old implementation once was. See Example 4-4 for an example.
// Binary search; this is the new implementationfunctionsearch(name,alphabeticalNames){letstartIndex=0;letendIndex=alphabeticalNames.length-1;while(startIndex<=endIndex){letmiddleIndex=Math.floor((startIndex+endIndex)/2);if(alphabeticalNames[middleIndex]==name)returnmiddleIndex;if(alphabeticalNames[middleIndex]>name){endIndex=middleIndex-1;}elseif(alphabeticalNames[middleIndex]<name){startIndex=middleIndex+1;}}return-1;}
// Binary search; this is the new implementationfunctionsearch(name,alphabeticalNames){letstartIndex=0;letendIndex=alphabeticalNames.length-1;while(startIndex<=endIndex){letmiddleIndex=Math.floor((startIndex+endIndex)/2);if(alphabeticalNames[middleIndex]==name)returnmiddleIndex;if(alphabeticalNames[middleIndex]>name){endIndex=middleIndex-1;}elseif(alphabeticalNames[middleIndex]<name){startIndex=middleIndex+1;}}return-1;}
与任何方法一样,这种方法也有一些缺点需要注意。如果您重构的代码对性能敏感,并且您在不支持真正多线程(PHP、Python 或 Node)的环境中操作,那么同时运行相同逻辑的两个版本可能不是一个好选择。假设您正在重构涉及发出一个或多个网络请求的代码;假设这些依赖关系不会随着重构而改变,您将连续执行两倍数量的网络请求。您必须权衡以高保真度审核更改的能力与相应的延迟增加。一种权衡可能是以采样率运行双代码路径并随后进行比较;如果此路径被非常频繁地访问,则仅运行 5% 的时间的比较就可以积累足够的数据,说明您的解决方案是否按预期运行,而不会对性能造成太大影响。
As with any approach, there are some downsides to be mindful of. If the code you are refactoring is performance-sensitive, and you’re operating in an environment that does not enable true multi-threading (PHP, Python, or Node), then running two versions of the same logic side by side might not be a great option. Say you’re refactoring code that involves making one or more network requests; assuming those dependencies do not change with the refactor, you’ll be executing double the number of network requests, serially. You must weigh the ability to audit your changes at a high fidelity against a corresponding increase in latency. One trade-off might be to run the dual code paths and subsequent comparison at a sampled rate; if this path is hit very frequently, running a comparison just 5 percent of the time might accumulate ample data about whether your solution is working as expected without compromising too heavily on performance.
我们还必须注意下游资源将承受的任何额外负载。这可以包括从数据库到消息队列,再到我们用来记录我们正在比较的代码路径之间的差异的系统。如果我们正在重构高流量路径,并且我们想要经常进行比较,我们需要确保我们不会意外地使底层基础设施负担过重。根据我的经验,比较可能会发现大量意想不到的差异(特别是在重构旧的复杂代码时)。采取缓慢、渐进的方法来增加双重执行和比较比冒着日志系统过载的风险更安全。设置一个小的初始采样率,解决任何高频差异,然后重复,逐步增加采样率,直到达到 100% 或稳定状态,您确信不会再出现差异。
We also have to be mindful of any additional load we’ll be subjecting to downstream resources. This can include anything from a database, to a message queue, to the very systems we are using to log differences across the codepaths we’re comparing. If we are refactoring a high-traffic path, and we want to run the comparison often, we need to be certain that we won’t accidentally overburden our underlying infrastructure. In my experience, comparisons can unearth a swarm of unexpected differences (particularly when refactoring old, complex code). It’s safer to take a slow, incremental approach to ramping up dual execution and comparison than to risk overloading your logging system. Set a small initial sample rate, address any high-frequency differences as they creep up, and repeat, increasing the sample rate step by step until you reach either 100 percent or a stable state at which you are confident no more discrepancies should arise.
对于 Smart DNA 的重构,更大的风险在于将每个存储库的众多依赖项迁移到与 Python 2.7 兼容的版本,而不是使用较新的 Python 版本运行现有代码本身。软件团队决定首先进行一些初步测试,在隔离环境中设置数据管道的子集,安装两个版本的 Python,并使用 2.7 环境中的新依赖项文件运行一些作业。当他们对初步测试的结果有信心时,他们会慢慢地、小心地在生产中引入新依赖项集的使用。
With the refactor at Smart DNA, the greater risk was in migrating each of the repositories’ many dependencies to versions compatible with Python 2.7, not with running the existing code itself, using the newer Python version. The software team decided that they would first perform a few preliminary tests, setting up a subset of the data pipeline in an isolated environment, installing both versions of Python, and running a few jobs, using the new dependency file in the 2.7 environment. When they were confident with the results of their preliminary tests, they would slowly, carefully introduce usage of the new set of dependencies in production.
为了限制所涉及的风险,团队审核了构成研究人员数据管道的作业,并根据其重要性对其进行分组。然后,工程师们选择了一个低风险的作业,其下游依赖关系最少,首先进行迁移。他们与研究团队合作,确定了交换配置以指向新的requirements.txt文件和新的 Python 版本的合适时机。一旦做出更改,团队计划监控作业生成的日志,以尽早发现任何异常行为。如果出现任何问题,配置将交换回其原始版本,同时软件团队进行修复。修复完成后,团队将重复该实验。作为其推广计划的一部分,团队要求将配置更改在生产中放置几天,让作业在转移到第二项作业之前成功运行十几次。
To limit the risk involved, the team audited the jobs that make up the researchers’ data pipeline and grouped them according to their importance. Then the engineers chose a low-risk job with the fewest downstream dependencies to migrate first. They worked with the research team to identify a good time to swap the configuration to point to the new requirements.txt file and new Python version. Once the change had been made, the team planned to monitor logs generated by the job to catch any strange behavior early. If any problems crept up, the configuration would be swapped back to its original version while the software team worked on a fix. When the fix was ready, the team would repeat the experiment. As part of their rollout plan, the team required the configuration change to sit in production for a few days, allowing for the job to run successfully on a dozen occasions before moving on to a second job.
成功迁移第二项作业后,软件团队会选择将所有低风险作业纳入新配置。然后,他们会对中等风险作业重复该过程。最后,对于最关键的作业,由于它们的重要性,团队决定单独迁移每一项。同样,他们会等待几天,然后再对下一项作业重复该过程,依此类推。总之,团队确定将整个数据管道迁移到新环境需要近两个月的时间。虽然这听起来像是一个艰苦的过程,但软件和研究团队都同意有必要充分降低风险。它让每个人都有足够的机会尽早逐步消除问题,确保管道在整个过程中尽可能保持健康。
After the second job was successfully migrated, the software team would opt-in all low-risk jobs to the new configuration. They would then repeat the process for the medium-risk jobs. Finally, for the most critical jobs, the team decided to migrate each of these individually, due to their importance. Again, they would wait a few days before repeating the process for the next job, and so on. In all, the team determined it would take nearly two months to migrate the entire data pipeline to the new environment. While this might sound like a grueling process, both the software and research teams agreed that it was necessary to reduce the risk sufficiently. It gave everyone adequate opportunity to weed out problems by small increments early, ensuring that the pipeline remained as healthy as possible throughout the entire process.
在第 1 章中,我提到,除非您有时间完成重构,否则不应着手进行重构。除非所有剩余的过渡工件都得到妥善清理,否则重构就不算完成。以下是我们在重构过程中生成的工件类型的简短(非详尽)列表。
In Chapter 1, I mentioned that you shouldn’t embark on a refactor unless you have the time to execute to completion. No refactor is complete unless all remaining transitional artifacts are properly cleaned up. Following is a short, not-exhaustive list of the kinds of artifacts we generate during the refactoring process.
我们大多数人都犯过忘记一两个功能标志的错误。忘记删除标志几天(甚至几周)还不算太糟,但如果不清理这些标志,则会带来切实的风险。首先,验证功能标志是否启用会增加复杂性。阅读受功能标志控制的代码的工程师需要考虑标志启用或禁用时的行为。这是持续部署环境中功能开发的必要开销,但我们应该在能够这样做后尽快优先删除它。其次,过时的功能标志会堆积起来。一个标志不会拖累你的应用程序,但数百个过时的标志肯定会。请养成良好的功能标志礼仪;添加作者和到期日期,并在这些日期过后与这些工程师进行跟进。
Most of us are guilty of leaving one or two feature flags behind. It’s not so bad to forget to remove a flag for a few days (or even a few weeks), but a tangible risk is associated with failing to clean these up. First, verifying whether a feature flag is enabled adds complexity. Engineers reading code gated by a feature flag need to consider the behavior if the flag is enabled or disabled. This is necessary overhead for feature development in a continuous deployment environment, but we should prioritize removing it soon after we are able to do so. Second, stale feature flags can pile up. A single flag won’t weigh down your application, but hundreds of stale flags certainly might. Practice good feature flag etiquette; add authors and expiration dates, and follow up with those engineers once those dates have passed.
我们可以尝试通过构建抽象来隐藏转换,以保护我们的重构不被其他开发人员发现。事实上,我们可能已经编写了一个使用“暗模式/亮模式”中概述的部署方法的重构。然而,一旦我们完成重构,这些抽象通常就不再有意义,并且会进一步让开发人员感到困惑。当我们的抽象仍然包含一些有意义的逻辑时,我们应该努力简化它们,以便将来阅读它们的工程师没有理由怀疑它们是为了顺利重构某些东西而编写的。
We can attempt to shield our refactor from other developers by building abstractions to hide the transition. In fact, we might have written one to use the deployment method outlined in “Dark Mode/Light Mode”. Once we’ve finished refactoring, however, these abstractions are generally no longer meaningful and can further confuse developers. When our abstractions still contain some meaningful logic, we should strive to simplify them so that engineers reading them in the future have no reason to suspect that they were written for the purpose of smoothly refactoring something.
当我们重构某些东西时,尤其是大规模重构某些东西时,我们通常会在推出后留下大量死代码。虽然死代码本身并不危险,但对于试图确定它是否仍在使用的工程师来说,它可能令人沮丧。回想一下“未使用的代码”,我们在那里讨论了将未使用的代码保留在代码库中的缺点。
When we’re refactoring something, particularly when we’re refactoring something at large scale, we typically end up with a sizable amount of dead code following rollout. Although dead code isn’t dangerous on its own, it can be frustrating for engineers down the line trying to determine whether it is still being used. Recall “Unused Code”, where we discussed the downsides of keeping unused code in the codebase.
在执行重构时,我们会留下各种注释。我们会警告其他开发人员代码有变动,也许会留下一些 TODO,或者记录重构完成后要删除的死代码。这些注释应该删除,以免误导任何人。如果我们偶然发现任何零散的、未完成的 TODO,我们会更加庆幸自己花时间整理了工作。
We leave a variety of comments when executing on a refactor. We warn other developers of code in flux, maybe leave a handful of TODOs, or make note of dead code to be removed once the refactor is finished. These comments should be deleted so as not to mislead anyone. On the off chance that we come across any stray, unfinished TODOs, we’ll be even more gratified that we took the time to tidy up our work.
根据我们执行重构的方式,我们可能除了编写现有单元测试外还编写了重复的单元测试来验证更改的正确性。我们需要清理所有新增的冗余测试,以免让以后引用它们的开发人员感到困惑。(如果您的团队想要维护一套快速的单元测试套件,冗余的单元测试也不是个好主意。)
Depending on how we’re executing the refactor, we may have written duplicative unit tests alongside existing ones to verify the correctness of our changes. We need to clean up any newly superfluous tests so that we don’t confuse any developers referencing them later. (Redundant unit tests also aren’t great if your team wants to maintain a speedy unit testing suite.)
几年前,我的一个队友进行了一项实验,以确定我们花了多少时间计算功能标志。对于我们后端系统的平均请求,它占执行时间的近 5%。不幸的是,我们花时间计算的大量功能标志已在所有生产工作区启用,并且可以完全删除。我们构建了一些工具来敦促开发人员清理过期的标志,并在短短几周内大大减少了处理它们所花费的时间。功能标志确实 很有意义!
A few years ago, a teammate of mine ran an experiment to determine how much time we were spending calculating feature flags. For the average request to our backend systems, it amounted to nearly 5 percent of execution time. Unfortunately, a great deal of the feature flags we were spending time calculating had already been enabled to all production workspaces and could have been removed entirely. We built some tooling to urge developers to clean up their expired flags and within just a few weeks had dramatically reduced the time spent processing them. Feature flags really do add up!
如果说我们为什么要清理我们产生的每一种过渡工件,有一个共同点,那就是尽量减少开发人员的困惑和挫败感。工件增加了额外的复杂性,遇到它们的工程师可能会浪费大量时间来了解它们的用途。通过清理它们,我们可以让每个人都免于沮丧!
If there’s a common thread for why we should clean up each of the kinds of transitional artifacts we produce, it’s to minimize developer confusion and frustration. Artifacts add additional complexity, and engineers encountering them risk wasting a considerable amount of time understanding their purpose. We can save everyone ample frustration by cleaning them up!
在执行重构工作时,选择一个标签,您的团队可以使用它来标记您需要清理的任何工件。它可以是像留下内联注释一样简单的东西TODO: project-name, clean up post release。无论它是什么,都要让它易于搜索,这样一旦您进入项目的最后阶段,您就可以快速找到所有可以使用最后润色的地方。
As you execute on your refactoring effort, choose a tag that your team can use to label any artifacts you’ll need to clean up. It can be something as simple as leaving an inline comment like TODO: project-name, clean up post release. Whatever it is, make it easy to search for so that once you’re in the final stages of the project, you can quickly locate all the places that could use a final polish.
在第 3 章中,我们讨论了在开始制定行动计划之前可以描述世界状态的各种方式。我们讨论了这些 指标如何向您的队友和管理层提供令人信服的理由来支持您的项目。在本章开始时,我们还描述了使用这些指标来定义最终状态的重要性(请参阅“定义您的最终状态”)。现在,我们需要用自己的指标来补充我们之前确定的中间步骤(请参阅“确定战略中间里程碑”)。这些将有助于您和您的团队确定您是否正在取得预期的进展,并在您的轨迹出现偏差时尽早纠正方向。
In Chapter 3, we discussed a wide variety of ways we could characterize the state of the world before we began forming a plan of action. We talked about how these metrics should make a compelling case in support of your project to your teammates and management alike. At the start of this chapter, we also described the importance of using these metrics to define an end state (see “Defining Your End State”). Now, we need to complement the intermediate steps we identified earlier (see “Identifying Strategic Intermediate Milestones”), with their own metrics. These will be useful for you and your team to determine whether you’re making the progress you expected to see, and course-correct early if your trajectory appears off.
执行计划也是管理层(无论是团队的产品经理、上级还是首席技术官 [CTO])对项目的第一印象之一。为了让他们支持该计划,您的问题陈述不仅需要令人信服且具有明确的成功标准,您的提案还需要包括明确的进度指标。表明您有明确的方向应该可以缓解他们对批准冗长的重构的任何担忧。
Execution plans are also one of the first glimpses management (whether that’s your team’s product manager, your skip-level, or your Chief Technology Officer [CTO]) will have of a project. For them to support the initiative, not only does your problem statement need to be convincing with clear success criteria, your proposal also needs to include definitive progress metrics. Showing that you have a strong direction should ease any concerns they might have about giving the go-ahead on a lengthy refactor.
回想一下表 4-1 ,我们在其中展示了起始指标和最终目标指标。对于每个里程碑,如果起始和结束指标适用于我们的中间阶段,我们可以添加一个条目,突出显示我们预计会更改哪些指标,以及我们的指标在重构期间是否适合中间测量,更改幅度是多少。
Recall Table 4-1, where we showed our starting metrics alongside our final goal metrics. For each of our milestones, if the start and end metrics are applicable to our intermediate stages, we can add an entry highlighting which metrics we expect to change and by how much if our metrics lend themselves well to intermediate measurements during the refactor.
最终目标指标可能更适合中间测量,包括复杂度指标、时间数据、测试覆盖率测量和代码行数。但请注意,您的测量结果可能会先恶化,然后才会再次好转!例如,考虑“暗模式/亮模式”中详述的方法;拥有两条代码路径,它们都做同样的事情,肯定会导致复杂度和代码行数明显上升。
End-goal metrics that might lend themselves better to intermediate measurements include complexity metrics, timings data, test coverage measurements, and lines of code. Be warned, however, that your measurements might trend worse before they trend better again! Consider the approach detailed in “Dark Mode/Light Mode”, for instance; having two code paths, both of which do the same thing, will definitely lead to a tangible uptick in complexity and lines of code.
不幸的是,在我们的 Python 迁移示例中,语言版本在项目的大部分时间里保持不变。只有当团队达到将新版本推广到公司每个环境的阶段时,我们才能开始看到指标的变化。为了衡量进度,我们需要提出一组不同的指标来跟踪整个开发过程。
Unfortunately, with our Python migration example, the language version remains the same throughout most of the project. Only once the team has reached the stage of rolling out the new version to each of the company’s environments can we start to see our metrics change. To measure progress, we will need to come up with a different set of metrics to track throughout development.
如上一节所示,并非所有最终目标指标都适合显示中期进展。如果情况如此,我们仍然需要至少一个有用的指标来指示势头。我们选择的指标可能与我们的最终目标没有直接关系,但它们是沿途的重要路标。
As the previous section showed, not all end-goal metrics will lend themselves well to showing intermediate progress. If that happens to be the case, we’ll still need at least one helpful metric to indicate momentum. The metrics we choose might not directly correlate to our final goal, but they’re important guideposts along the way.
有许多简单的选项。假设我们在 Smart DNA 中设置了持续集成,并启用了 linter 来警告未定义的变量。我们可以使用剩余警告的数量作为衡量其在该步骤范围内的进度的指标。表 4-2显示了我们在“确定战略中期里程碑”中集思广益的每个主要里程碑及其相应的
指标。(请注意,linting 里程碑的起始值是一个近似值。团队在这里通过运行提供了一个估算值,默认配置在三个存储库中运行并总结生成pylint的警告数量
。
There are a number of simple options. Say at Smart DNA we’ve set up continuous integration and enabled the linter to warn of undefined variables. We can use the number of warnings remaining as a metric to measure their progress within the scope of that step. Table 4-2 shows each of the major milestones we brainstormed in “Identifying Strategic Intermediate Milestones” with their corresponding
metric. (Note that the starting value for the linting milestone is an approximation. The team provided an estimate here by running pylint, with the default configuration running across the three repositories and summing up the number of warnings
generated.
| 里程碑描述 | 指标描述 | 开始 | 目标 | 观察到 |
|---|---|---|---|---|
创建单个requirements.txt文件 Create a single requirements.txt file |
不同依赖项列表的数量 Number of distinct lists of dependencies |
3 3 |
1 1 |
- - |
将所有存储库合并为一个存储库 Merge all the repositories into a single repository |
不同存储库的数量 Number of distinct repositories |
3 3 |
1 1 |
- - |
使用所有必需的软件包构建 Docker 映像 Build a Docker image with all the required packages |
使用新 Docker 镜像的环境数量 Number of environments using new Docker image |
0 0 |
5 5 |
- - |
通过持续集成为 monorepo 启用 linting Enable linting through continuous integration for the monorepo |
linter 警告的数量 Number of linter warnings |
约15,000 approx. 15,000 |
0 0 |
- - |
在所有环境中安装并推出 Python 2.7.1 Install and roll out Python 2.7.1 on all environments |
使用新requirements.txt文件在 Python 2.7.1 上运行的作业数量 Number of jobs running on Python 2.7.1 with new requirements.txt file |
0 0 |
158 158 |
- - |
在花时间将指标与最重要的里程碑联系起来后,我建议开始进行估算。我们的计划尚未进入最后阶段,因此我们的估算不应过于具体(例如,按周或月而不是天计算),但最重要的是,应该慷慨大方。
After taking the time to associate metrics with our most important milestones, I recommend starting to make estimates. Our plan isn’t in its final stages quite yet, so our estimates should not be terribly specific (e.g., on the order of weeks or months rather than days) but, most importantly, should be generous.
回到我们的加拿大公路之旅,我们已经制定了一些一般准则,规定了从蒙特利尔到温哥华的旅途中何时何地停下来吃饭和睡个好觉。我们计划行驶的最长路程是从萨斯喀彻温省里贾纳到阿尔伯塔省卡尔加里之间的路段;大约 800 公里的高速公路,大约 7.5 小时的车程。通过确保每天开车不超过八小时,我们给自己留出了充足的时间,可以在早上从出发点收拾行李,并决定如何分配我们的一天。重要的是,我们给自己留出了足够的时间享受旅程;我们仍然打算每天取得一些重大进展,但不会太认真,以免到达温哥华时精疲力竭。
Going back to our cross-Canada roadtrip, we’ve set some general guidelines for when and where we want to stop for food and a good night’s sleep along our trip from Montreal to Vancouver. The longest drive we plan to do is the stretch between Regina, SK, and Calgary, AB; just under 800 km of highway for roughly a 7.5-hr drive. By making sure that we’re never driving more than eight hours per day, we’re giving ourselves plenty of time to pack up in the morning from our starting point and decide how to distribute our day. What’s important is that we’ve given ourselves enough time to enjoy the journey; we still intend to make some serious strides every day, but not so serious that we’ll be burnt out by the time we reach Vancouver.
大多数团队都有自己的估算指南和流程,但如果你还没有(或者不太清楚如何估算一个特别大的软件项目),这里有一个简单的技巧。检查每个里程碑并分配一个从 1 到 10 的数字,其中 1 表示相对较短的任务,10 表示较长的任务。估计最长的里程碑可能需要多长时间。现在想象一下在那个里程碑期间最有可能出错的地方,并更新你的估计以考虑到它。(不要做过头!合理地增加我们估计的缓冲量很重要;否则,领导层最终可能会认为我们的重构不值得。)现在,将每个较短的 里程碑与这个较长的里程碑进行比较。如果你预计最长的里程碑需要 10 周才能完成,而第二长的里程碑也需要差不多同样多的时间,那么也许 9 周是一个不错的估计。继续按照列表进行下去,直到你对所有内容都给出了粗略的估计。
Most teams have their own guidelines and processes around deriving estimates, but if you don’t have one already (or don’t quite know how to go about estimating a particularly large software project), here’s a simple technique. Go through each of the milestones and assign a number from 1 to 10, where 1 denotes a relatively short task and 10 denotes a lengthy task. Estimate how long your lengthiest milestone might take. Now imagine what is most likely to go wrong during that milestone and update your estimate to account for it. (Don’t overdo it! It’s important to be reasonable with the amount of buffer we add to our estimates; otherwise, leadership might ultimately decide our refactor is not a worthwhile endeavor.) Now, measure each shorter milestone against this lengthier one. If you anticipate that your longest milestone will take 10 weeks to complete, and your second-longest milestone should take almost as much time, then maybe nine weeks is a good estimate. Keep going down the list until you’ve given everything a rough estimate.
从重构的角度来看,设定宽松的估算很重要,主要有两个原因。首先,当您遇到不可避免的障碍时,它为您的团队提供了回旋余地。软件项目越大,事情不按计划进行的可能性就越大,重构也不例外。在您的估算中建立合理的缓冲将使您的团队有机会在重要的截止日期之前完成任务,同时解决过程中的一些棘手的错误和事件。
From a refactoring perspective, setting generous estimates is important for two main reasons. First, it gives your team wiggle room for when you run into the inevitable roadblock or two. The larger the software project, the greater the chance something won’t go quite to plan, and refactoring is no exception to that rule. Building a reasonable buffer into your estimates will give your team a chance to hit important deadlines while accounting for a few pesky bugs and incidents along the way.
大规模的重构工作往往会影响多个团队,因此您的项目很有可能意外地与另一个团队的项目发生冲突。设定宽松的预算可以让您更顺利地应对这些情况;您将更加冷静地与另一个团队进行谈判,因为您知道自己有足够的时间实现下一个里程碑。您更有可能想出创造性的僵局解决方案。如果您的团队需要暂停当前里程碑的工作,也许您可以快速转向,将注意力转移到重构的不同部分,稍后再回到当前工作。
Large-scale refactoring efforts tend to affect multiple teams, so there’s a reasonable chance that your project might end up unexpectedly butting heads with another team’s project. Setting generous estimates allows you to navigate those situations more smoothly; you’ll be more level-headed going into negotiations with the other team, knowing you have sufficient time to hit your next milestone. You’re more likely to come up with creative solutions to the impasse. If your team needs to pause work on the current milestone, maybe you can pivot quickly, shifting your focus to a different portion of the refactor, and come back to the current work later.
其次,这些估算将帮助您与利益相关者(产品经理、主管、首席技术官)和可能受到重构影响的团队设定期望。接下来,我们将询问他们对我们计划的看法,如果我们在提供的估算中谨慎地建立充足的缓冲,我们将有一些谈判空间。下一节将更详细地介绍如何最好地引导这些对话。
Second, these estimates will help you set expectations with stakeholders (product managers, directors, CTOs) and teams that risk being affected by your refactor. We’ll ask them for their perspective on our plan next, and if we’re careful to build ample buffers into the estimates we provide, we’ll have some room to negotiate. The next section deals more closely with how to best navigate these conversations.
请记住,您可以对整个项目给出的估计值大于其各个部分的总和。除非您的组织对如何估算软件项目非常严格,否则没有规则规定预期的项目完成日期应与其各个部分的完成日期完全一致。
Remember that you can give the overall project a greater estimate than the sum of each of its parts. Unless your organization is stringent about how to estimate software projects, no rule states that the anticipated project completion date should precisely line up with the completion of its individual components.
大型重构项目通常会影响所有学科的大量工程组。您可以通过逐步执行计划并确定您认为在每个阶段受重构影响最大的团队来确定有多少(以及哪些)工程组。与您的团队(或一小群值得信赖的同事)集思广益,以确保您已涵盖各种学科和部门。如果您的公司规模足够小,请考虑查看所有工程部门的列表,并针对每个组确定他们是否愿意为您的计划提供意见。许多公司都会组建技术设计委员会,您可以向该委员会提交项目提案,以供来自公司不同学科的工程师进行评审。如果可以,请利用这些委员会;您很可能会在启动会议之前了解到大量有用的信息。
Large refactoring projects typically affect a large number of engineering groups of all disciplines. You can determine just how many (and which ones) by stepping through your execution plan and identifying any teams you think might be most closely affected by your refactor at each stage. Brainstorm with your team (or a small group of trusted colleagues) to make sure you’ve covered a variety of disciplines and departments. If your company is small enough, consider going through a list of all engineering departments and for each group decide whether they might appreciate the opportunity to provide input on your plan. Many companies put together technical design committees, to which you can submit a project proposal to be critiqued by engineers of different disciplines from across the company. Take advantage of these committees if you can; you’re likely to learn a great deal of useful information well before your kick-off meeting.
与其他团队分享执行计划有两个主要原因。第一个原因,也许是最重要的原因,是为了提供透明度。第二个原因是在寻求管理层认可之前,收集有关计划的观点以进一步加强计划。
There are two primary reasons for sharing your execution plan with other teams. The first, and perhaps most important reason, is to provide transparency. The second is to gather perspective on your plan to strengthen it further before seeking buy-in from management.
透明度有助于在团队之间建立信任。如果你对公司的其他工程师坦诚相待,他们就更有可能参与并投入你的努力。这应该是不言而喻的,但如果你的团队起草了一份计划并开始执行一项影响多个团队的重构,而没有发出警告,你就有可能严重破坏这种关系。
Transparency helps build trust across teams. If you’re upfront with other engineers at the company, they’re more likely to be engaged and invested in your effort. It should go without saying, but if your team drafts a plan and starts executing on a refactor that affects a number of groups without warning, you risk dangerously eroding that relationship.
您必须注意,您提议的更改可能会彻底改变他们拥有的代码或影响他们维护的重要流程。通过 Smart DNA 的 Python 迁移,我们将三个存储库合并为一个。对于在这些存储库中工作的任何开发人员或研究人员来说,这都是一个重大变化。受影响的团队应该得到充分的预先警告,他们的开发流程将会改变。
You must be mindful of the fact that your proposed changes could drastically change code that they own or affect important processes they maintain. With Smart DNA’s Python migration, we’re combining three repositories into one. This is a significant change for any developer or researcher working in any of these repositories. The affected teams should be adequately forewarned that their development process is going to change.
重构还可能影响其他团队的生产力。例如,如果我们提议将所有必需的包合并到一个全局requirements.txt文件中,我们可能需要其他团队的帮助来审查和批准他们的更改。我们甚至可能会询问从其他团队借用工程师来帮助重构(有关如何招募队友的更深入的信息,请参阅第 6 章)。
The refactor also risks affecting other teams’ productivity. For instance, if we’re proposing to combine all required packages into a single, global requirements.txt file, we may need other teams’ help getting their changes reviewed and approved. We might even inquire about borrowing engineers from other teams to help out with the refactor (see Chapter 6 for a more in-depth look at how to recruit teammates).
同样,您必须确保您的计划与受影响的团队保持一致。如果您计划修改另一个团队拥有的代码,而他们正计划开始开发一个主要功能(或者可能是他们自己的大规模重构),则需要进行协调以确保不会互相干扰。
Similarly, you have to make sure that your plans align with affected teams. If you’re planning to modify code owned by another team just as they are planning to kick off development on a major feature (or perhaps their own hefty refactor), you will need to coordinate to make sure you aren’t stepping on each other’s toes.
与其他团队分享您的计划的第二个原因是了解他们的观点。您已经进行了研究来定义问题并起草了全面的计划,但是那些可能受到您提议的变更影响的团队是否支持您的努力?如果他们不相信您的重构的好处大于给他们的团队带来的风险和不便,您可能需要重新考虑您的方法。也许您可以以更有说服力的方式传达这些好处,或者找到一种方法来降低与当前计划相关的风险级别。与团队合作,找出什么会让他们对您的计划更满意。(您可以使用下一章中概述的一些技术来提供帮助。)
The second reason to share your plan with other teams is to get their perspective. You’ve done the research to define the problem and draft a comprehensive plan, but are the teams that risk being affected by your proposed changes supportive of your effort? If they do not believe that the benefits of your refactor outweigh the risks and inconvenience to their team, you may need to reconsider your approach. Perhaps you could convey the benefits in a more convincing manner, or find a way to reduce the level of risk associated with the current plan. Work with the team to figure out what would make them more comfortable with your plan. (You can use some of the techniques outlined in the next chapter to help out.)
如果您正在重构一个复杂的产品,那么可能有许多您没有考虑到的极端情况。仅仅获得第二双(以及第三双和第四双)眼睛的帮助就可以产生巨大的不同。假设在审核 Smart DNA 研究团队使用的软件包时,我们没有注意到一些研究人员直接在其中一台机器上手动更新requirements.txt文件,而不是在版本历史记录中进行更改并部署新代码。当我们与研究人员分享我们的计划时,他们会指出,他们通常在机器本身上更新依赖项,软件团队应该在那里验证版本,而不是检查其存储库中的版本。如果我们在没有先咨询研究人员的情况下开始执行该项目,这种洞察力将为我们的软件团队节省大量痛苦和尴尬。
If you’re working to refactor a complex product, there are likely a number of edge cases you haven’t considered. Just getting that second (and third and fourth) set of eyes can make a huge difference. Let’s say that while auditing the packages used by the research team at Smart DNA, we fail to notice that some researchers have been manually updating a requirements.txt file on one of the machines directly, rather than making their changes in version history and deploying the new code. When we share our plan with the researchers, they’ll point out that they typically update their dependencies on the machine itself and that the software team should verify the version there rather than the one checking into their repository. That insight would have saved our software team a great deal of pain and embarrassment had we started executing on the project without consulting the researchers first.
请记住,虽然在开始执行之前征求利益相关者对计划的意见很重要,但在这个阶段没有什么是一成不变的。您的计划可能会在重构过程中发生变化;您会遇到一两个意想不到的极端情况,也许会花费比预期更多的时间来解决一个棘手的错误,或者意识到您最初的方法的一部分根本行不通。在这个阶段,我们寻求其他观点主要是为了确保与他人的透明度并尽早消除明显的问题。我们将在第 7 章中讨论如何在计划发展过程中让这些利益相关者参与并了解情况。
Remember that while it’s important to get stakeholders’ opinions about your plan before kicking off execution, nothing is set in stone at this stage. Your plan will likely change throughout the duration of the refactor; you’ll run into an unexpected edge case or two, maybe spend more time than anticipated solving a pesky bug, or realize part of your initial approach simply won’t work. At this stage, we are seeking out other perspectives mostly as a means of ensuring transparency with others and weeding out the blatantly obvious problems early. We’ll discuss how to keep these stakeholders engaged and informed as our plan evolves in Chapter 7.
在 Smart DNA,软件团队努力制定了从 Python 2.6 迁移到 2.7 的全面执行计划。在逐步完成我们概述的每个步骤、定义目标状态、确定重要里程碑、选择推广策略等之后,团队制定了一个有信心的计划,如下所示:
At Smart DNA, the software team worked diligently to build a comprehensive execution plan for its migration from Python 2.6 to 2.7. After stepping through each of the steps we’ve outlined, defining a goal state, identifying important milestones, choosing a rollout strategy, and so on, the team had a plan it was confident about, as follows:
创建单个requirements.txt文件。
指标:依赖项的不同列表数量;开始: 3;目标: 1
预计: 2-3 周
子任务:
枚举每个存储库中使用的所有包。
审核所有软件包并将列表缩小到仅包含相应版本的所需软件包。
确定在 Python 2.7 中每个包应该升级到哪个版本。
Create a single requirements.txt file.
Metric: Number of distinct lists of dependencies; Start: 3; Goal: 1
Estimate: 2–3 weeks
Subtasks:
Enumerate all packages used across each of the repositories.
Audit all packages and narrow the list to only the required packages with corresponding versions.
Identify which version each package should be upgraded to in Python 2.7.
将所有存储库合并为一个存储库。
指标:不同存储库的数量;开始: 3;目标: 1
预计: 2-3 周
子任务:
创建一个新的存储库。
对于每个存储库,使用 git submodules 添加到新存储库。
Merge all the repositories into a single repository.
Metric: Number of distinct repositories; Start: 3; Goal: 1
Estimate: 2–3 weeks
Subtasks:
Create a new repository.
For each repository, add to the new repository, using git submodules.
使用所有必需的软件包构建 Docker 映像。
指标:使用新 Docker 映像的环境数量;开始: 0;目标: 5
预计: 1-2周
子任务:
在每个环境上测试 Docker 映像。
Build a Docker image with all the required packages.
Metric: Number of environments using new Docker image; Start: 0; Goal: 5
Estimate: 1–2 weeks
Subtasks:
Test the Docker image on each of the environments.
通过对 monorepo 的持续集成启用 linting。
Enable linting through continuous integration for the monorepo.
指标: linter 警告数量;起始:约 15,000;目标: 0
预计: 1-1.5个月
子任务:
选择一个 linter 和相应的配置。
将 linter 集成到持续集成中。
使用 linter 识别代码中的逻辑问题(未定义的变量、语法错误等)。
Metric: Number of linter warnings; Start: approx. 15,000; Goal: 0
Estimate: 1–1.5 months
Subtasks:
Choose a linter and corresponding configuration.
Integrate the linter into continuous integration.
Use the linter to identify logical problems in the code (undefined variables, syntax errors, etc.).
在所有环境中安装并推出 Python 2.7.1。
指标:使用新requirements.txt文件在 Python 2.7.1 上运行的作业数量;开始: 0;目标: 158
预计: 2-2.5个月
子任务:
为每个存储库找到测试;确定哪些测试是可靠的。
在低风险脚本子集上使用 Python 2.7。
将 Python 2.7 推广到所有脚本。
Install and roll out Python 2.7.1 on all environments.
Metric: Number of jobs running on Python 2.7.1 with new requirements.txt file; Start: 0; Goal: 158
Estimate: 2–2.5 months
Subtasks:
Locate tests for each repository; determine which tests are reliable.
Use Python 2.7 on a subset of low-risk scripts.
Roll out Python 2.7 to all scripts.
如果您使用项目管理软件(如 Trello 或 JIRA)来跟踪团队的项目,我建议为大型里程碑创建一些顶级条目。虽然重构的一些细节可能会在整个开发过程中发生变化,但您在本章中定义的战略里程碑不太可能发生巨大变化。
If you use project management software (like Trello or JIRA) to keep track of your team’s projects, I recommend creating some top-level entries for the large milestones. While some of the nitty-gritty details of the refactor might change throughout development, the strategic milestones you defined in this chapter are less likely to shift dramatically.
对于单个子任务,您应该考虑为您计划执行的前 1 或 2 个里程碑创建条目。您可以找出您的团队在整个开发过程中需要以更规律的节奏处理的较小任务。后面的里程碑更有可能受到早期工作的影响,并且其单个子任务的细节可能会发生变化。仅在您启动后续里程碑的子任务时才为其创建条目。
For the individual subtasks, you should consider creating entries for the first one or two milestones you’re planning to undertake. You can figure out smaller tasks your team needs to tackle at a more regular cadence throughout the development process. Later milestones are more likely to be affected by earlier work, and the specifics of their individual subtasks risk changing. Create entries for the subtasks of subsequent milestones only as you kick them off.
我们已经完成了必要的前期工作,全面了解并描述了大规模重构所涉及的工作,并成功制定了执行计划,我们相信该计划将使我们顺利完成。现在,我们需要获得经理(和其他重要利益相关者)的必要支持,以支持重构,然后我们才能满怀信心地继续前进。
We’ve done the preliminary work required to understand and comprehensively characterize the work involved with our large-scale refactor, and successfully crafted an execution plan we’re confident will lead us to the finish line smoothly. Now, we need to get the necessary buy-in from our manager (and other important stakeholders) to support the refactor before we can confidently forge ahead.
到了高中三年级,我决定我需要一部手机。不仅我几乎每个朋友都有一部手机,而且他们也不再愿意让我每次需要通知父母我的行踪时都用他们的手机给他们打电话。每条短信大约要花 10 美分,每次打电话也要花掉他们宝贵的时间,几个月来,我给近六个朋友花了 10 美分和 25 美分。无论去哪里都带着满口袋的零钱,希望能借别人的手机,这已经不再是我的爱好了。
By my junior year of high school, I decided I needed a cellphone. Not only did nearly every one of my friends have one, they were no longer interested in having me use theirs to call my parents every time I needed to inform them of my whereabouts. With each text costing roughly 10 cents, and each call costing them precious minutes, I was dishing out dimes and quarters to nearly a half-dozen friends for months. Carrying a pocketful of change wherever I went, hoping I could borrow someone’s phone, was no longer my cup of tea.
因为我的父母并不支持女儿拥有手机,所以说服他们买一部手机将是一场艰苦的战斗。“其他人都有”是行不通的。我需要向父母提供一套强有力的证据支持论据。所以,我整理了一些。我围绕安全原因制定了一个论点。我最近拿到了驾照,我需要能够在紧急情况下给别人打电话。我粗略估计了每周开车的时间,以使这个论点更有说服力。接下来,我比较了设备和计划成本,并将其与过去六个月我分给朋友的钱数进行了比较。我最近开始建立网站来赚点外快,我知道我可以买得起一部基本的翻盖手机并支付每月的账单。
Because my parents weren’t proponents of their daughter having a cellphone, convincing them to get one was going to be an uphill battle. “Everyone else has one” was not going to cut it. My parents would need to be presented with a strong set of evidence-backed arguments. So, I put some together. I formulated an argument around owning a cellphone for safety reasons. Having recently obtained my driver’s license, I needed to be able to call someone in case of an emergency. I calculated a rough estimate of the number of hours per week I spent driving to give the argument a bit more weight. Next, I compared device and plan costs, comparing these to the amount of money I’d distributed to friends over the last six months. I’d recently started building websites to make a bit of money on the side and knew I could afford to buy a basic flip phone and pay the monthly bill.
听到我的辩解,我父母说他们不认为这是必需品。我出门时可以借用我妈妈的手机。在指出我每周要花三到四个小时开车带我和弟弟四处转悠后,他们决定这也许根本不是奢侈的。他们确信手机的便利性超过了它的成本。几天后,我得到了一部二手翻盖手机,上面有我自己的号码。
In response to my arguments, my parents said they didn’t think it was a necessity. I could borrow my mother’s phone when leaving the house. After pointing out that I was spending three to four hours per week driving both myself and my little brother around, they decided that maybe it wasn’t a luxury after all. They were sufficiently convinced that the convenience of having a cellphone outweighed its cost. I got a hand-me-down flip phone with a number of my own a few days later.
如今,当我需要说服他人开始重构项目的好处时,这段经历对我大有裨益。我从同事那里听到最多的抱怨之一是,他们有强烈的重构愿望,但他们就是不知道如何说服别人让他们这么做。他们花时间确定问题出现的情况,找到证据和指标来描述问题,以便更好地理解问题,并精心制定解决问题的计划。他们确信这个问题需要解决,并对自己的解决方案欣喜若狂,但在向经理或技术主管提出自己的想法时,却遭到了怀疑。
Today, this experience serves me well when I have to convince others about the benefits of beginning a refactoring project. One of the complaints I hear most often from fellow engineers is that they have a strong desire to refactor something but they simply don’t know how to convince anyone to let them do it. They’ve spent the time identifying the circumstances under which the problem arose, found evidence and metrics to characterize the problem so that they might better understand it, and carefully crafted a plan for how to solve it. They’re certain that the problem needs to be solved and are ecstatic about their solution, but are met with skepticism when presenting their ideas to either their manager or tech lead.
本章将首先解释为什么您的经理可能不同意,并帮助您了解他们的观点,以便您可以提出令人信服的论点。接下来,我们将介绍您可以采取的几种不同方法来获得管理团队的支持,以及您可以使用的一些具体策略来让他们支持您。最后,我们将研究支持可以采取的一些形式,以及这些形式如何影响您的执行计划和您最终组建的团队。
This chapter will kick things off by explaining why your manager might not be on board, and help you understand their perspective so that you can craft a compelling argument. Next, we’ll cover a few different approaches you can take for garnering the support of your management team, with some specific strategies you can use to get them rallying behind you. Finally, we’ll look at some of the forms buy-in can take, and how these can affect both your execution plan and the team you end up putting together.
您的经理可能会因为几个不同的原因而犹豫(或完全反对)大规模重构。首先,他们通常远离代码,不太可能深入了解代码的痛点。其次,他们的评估基于其团队按时交付有效产品功能的能力。第三,与大规模重构相关的最坏情况结果通常比与新产品功能相关的最坏情况结果严重得多。最后,大规模重构通常需要与您直接团队之外的利益相关者进行更多协调。
Your manager might be hesitant (or outright opposed) to a large refactor for a few distinct reasons. First, they are typically well-removed from the code and unlikely to understand its pain points intimately. Second, they are evaluated on their team’s ability to ship effective product features on time. Third, the worst-case outcomes associated with a large refactor are generally much more serious than the worst-case outcomes associated with a new product feature. Finally, large-scale refactors typically require much more coordination with stakeholders outside of your immediate team.
大多数工程经理很少编写代码,也很少参与代码审查。事实上,直接被新公司聘为管理职位的人甚至可能永远看不到他们的团队正在处理的代码。由于您的经理并不熟悉您和您的团队在开发过程中经常遇到的问题,因此他们对您的提议持怀疑态度也就不足为奇了。想象一下,试图向晚宴客人解释为什么要更换家里所有摇摇晃晃的门把手;他们可能能够从逻辑上理解这种沮丧,但他们不知道这些门把手在日常生活中有多烦人。
Most engineering managers are rarely coding, and hardly partaking in code reviews. In fact, someone hired directly into a management position at a new company might never even see the code their team works on. Because your manager isn’t intimately familiar with the problems you and your team are frequently encountering during development, it shouldn’t be a surprise that they are skeptical of your proposal. Imagine trying to explain to a dinner guest why you want to replace all of the rickety door knobs in your home; they might be able to understand the frustration logically, but they don’t know the extent to which such door knobs are irritating on a daily basis.
也许你的经理明白重构旨在改善的困难,但他们不明白为什么现在应该解决这些问题。毕竟,如果这些问题不是新问题,那么公司一定已经很好地处理了它们(并且正在继续处理它们)。你的经理正在权衡构建新事物的潜在好处与解决一系列挥之不去的问题。
Maybe your manager understands the difficulties your refactor aims to improve, but they fail to see why these should be fixed now. After all, if these problems are not new, the company must have been handling them (and is continuing to handle them) just fine. Your manager is weighing the potential upside of building something new against fixing a set of lingering problems.
经理的评估标准往往是其团队按时完成任务和帮助实现业务目标的能力。这些目标往往包括开发有助于留住和吸引更多用户的功能,或开辟新的收入来源。由于经理有这些激励措施,他们更有可能优先考虑那些影响与投入比率高的工作,即投入相对较少但影响较大的工作。经理也更有可能设定更激进的截止日期,希望尽快将这些更改推送给客户。
Managers tend to be evaluated on their team’s ability to hit deadlines and help achieve business objectives. These tend to include things like building features that help retain and acquire more users, or unlocking new revenue streams. Because managers have these incentives, they’re more likely to prioritize work that has a high impact-to-effort ratio—that is, work that is relatively low-effort but offers a high impact. Managers are also more likely to set more aggressive deadlines in hopes of getting these changes out to customers sooner.
这些目标有时与团队中工程师的目标不一致。工程师倾向于寻找解决有趣问题的项目,并且通常优先构建更强大的解决方案,而不是快速交付的解决方案。(并非所有工程师都适合这种模式,但根据我的经验,这总结了其中相当一部分。)大规模重构,虽然可能是你和你的队友有意义的、值得的努力,但却排在你经理的潜在项目列表的最后。大规模重构通常很冗长,而且由于它们故意对用户不可见,因此对业务几乎没有直接的积极影响。如果你的经理正在寻求晋升(或者他们担心即将到来的评估),他们可能不会那么热衷于支持你的计划。
These goals are sometimes at odds with those of the engineers on the team. Engineers tend to seek out projects that solve interesting problems and often prioritize building a more robust solution over one that’s quick to ship. (Not all engineers fit into this mold, but in my experience, this sums up quite a few of them.) A large refactor, while perhaps a meaningful, worthwhile endeavor by you and your teammates, is at the bottom of your manager’s potential projects list. At-scale refactors are usually lengthy, and because they are deliberately invisible to users, result in little to no immediate positive impact to the business. If your manager is looking to move up the ladder (or if they’re concerned about their upcoming review), they will probably be less than eager to support your plan.
即使你的经理确信重构是值得的,他们也可能因为允许你这样做而冒着失去良好声誉的风险。正如你的经理会根据你的团队按时构建和交付的能力进行评估一样,他们自己的经理也会根据他们的组织对业务的影响进行评估。你的经理可能很难说服他们自己的经理,重构是一项有价值的工程时间和资源投资。
Even if your manager is convinced that the refactor is worthwhile, they might be risking good standing by giving you the go-ahead. Just as your manager is evaluated on your team’s ability to build and deliver on time, their own manager is equally evaluated on the impact that their organization can have on the business. It can be difficult for your manager to convince their own manager that a refactor is a valuable investment of engineering time and resources.
功能开发出错的方式有很多种。您的团队可能会遇到一些障碍,交付时间可能比最初预期的晚一些,或者功能交付给用户后,却发现了大量令人讨厌的错误。然而,新功能开发过程中发生灾难性中断的可能性相对较低,因为新功能的范围往往相对较好,边界也相对明确。
There are a handful of ways feature development can go awry. Your team might run into a handful of roadblocks and ship a bit later than initially anticipated, or maybe the feature makes it into the hands of users, only to reveal an abundance of pesky bugs. However, the likelihood of a catastrophic outage during the development of a new feature is relatively low because new features tend to be relatively well-scoped, with relatively well-defined boundaries.
执行大规模重构时,风险要大得多。团队冒着在大范围内引入回归的风险,发生灾难性中断的可能性也不容忽视。在解开陈旧、复杂的代码时,您的团队发现意外错误的可能性要大得多;为了修复错误而陷入困境的风险可能会大大延迟您的最后期限。我们在第 1 章中强调的每一个风险对您的经理来说都是显而易见的。
The stakes are much greater when executing on a large-scale refactor. The team risks introducing regressions across a large surface area, and the likelihood of a disastrous outage is not nearly as negligible. When untangling old, crufty code, there’s a much greater chance your team will unearth unexpected bugs; the risk of being pulled head first into a rabbit hole in an attempt to fix them can significantly delay your deadlines. Each and every one of the risks we highlighted in Chapter 1 is alarmingly obvious to your manager.
大多数公司都会围绕其产品(或多个产品)的各个部分组织工程团队。假设您为一个名为 RadTunes 的音乐流媒体应用程序工作。RadTunes 可能有一个团队负责创建播放列表,另一个团队负责管理搜索。当一个团队着手构建新功能时,它通常会在其拥有的代码库区域内进行操作。如果搜索团队构建一项新功能允许用户创建协作播放列表,那将令人惊讶;更明显的选择是播放列表团队这样做。
Most companies organize engineering teams around individual portions of their product (or products). Say you work for a music streaming application called RadTunes. RadTunes might have a team responsible for playlist creation, and another for managing search. When a team sets out to build a new feature, it typically is operating within an area of the codebase that it owns. It’d be surprising to see the Search team build a new feature allowing users to create collaborative playlists; the more obvious choice would be for the Playlist team to do so.
现在想象一下,您是播放列表团队的一员,团队正在努力解决歌曲对象模型问题。您已经想出了一个改进计划,但它涉及修改公司中几乎每个团队经常使用的代码。您和您的经理需要在开始时与这些团队中的每一个进行协调以征求支持,并在整个重构过程中继续协调以确保每个人都能正确协调。当您向经理介绍您的重构时,他们会看到在整个项目期间保持每个人井然有序所需的巨大工作量。他们可能不愿意支持它,这是很正常的。
Now imagine that you are on the Playlist team and the team is struggling with the song object model. You’ve come up with a plan for improving it, but it involves modifying code nearly every one of the teams at the company works with regularly. You and your manager will need to coordinate with every one of these teams at the onset to solicit support, and continue to coordinate throughout the refactor to make sure everyone is properly aligned. When you pitch your refactor to your manager, they are seeing the colossal amount of work required to keep everyone organized for the complete duration of the project. It’s only normal that they might be hesitant to support it.
现在我们知道了为什么我们的经理可能不同意,我们可以集中讨论一些有用的策略来缓解他们的恐惧并构建一个强有力的案例来说服他们重构是值得的。本节假设您已经与您的经理就您的项目进行了初步的调查谈话。如果你还没有进行过那次谈话,“初步谈话”是一个很好的起点。这次谈话很重要,原因有二。首先,它可以帮助你了解哪些因素对你的经理影响最大。其次,它让你了解你的经理是否更容易被情感或逻辑论点说服。这次谈话将为你提供初步的背景信息,让你选择最有效的策略来说服你的经理。
Now that we understand why our manager might not be onboard, we can focus on a few helpful strategies for assuaging their fears and constructing a robust case to convince them that the refactor is worthwhile. This section assumes you’ve already had an initial investigatory conversation with your manager about your project. If you haven’t had that conversation yet, “Initial Conversation” is a good starting point. This conversation is important for two reasons. First, it helps you understand which factors are weighing most heavily on your manager. Second, it gives you a sense of whether your manager might be more readily convinced by an emotional or logical argument. This conversation will give you the preliminary context you need to choose the most effective strategies to convince your manager.
与经理进行初次谈话后,你就可以集中精力考虑可能需要使用的说服技巧。我们将在此概述四种简单而独特的技巧,但请注意,这并不是一份详尽的清单。不同的策略对不同的经理最有效,具体取决于什么最能激励他们(例如,他们在公司的成长轨迹)或他们反对重构的程度(例如,他们大体上同意问题的存在,但不相信应该立即修复)。最终,促使经理批准的最有效方法是结合使用多种技巧:选择你认为影响最大且最容易使用的技巧。如果你有信心并且准备充分,你可能会得到你一直 寻求的“是”的答复。
Once you’ve had that initial conversation with your manager, you can zero in on the persuasion techniques you might want to use. We’ll outline four simple, distinct techniques here, but know that this is not an exhaustive list. Different strategies will work best with different managers, depending on what motivates them the most (e.g., their growth trajectory at the company), or the degree to which they are opposed to the refactor (e.g., they are generally in agreement that the problem exists but are unconvinced it should be fixed imminently). Ultimately, the most effective way to nudge your manager into giving the go-ahead is to use a combination of techniques: opting for those you believe will have the most impact and are most comfortable using. If you are confident and well-prepared, you might just get the “yes” you’ve been seeking.
我们的一些同事可以走进一个满是固执的工程师的会议,在半小时内就说服了所有人。不幸的是,我不是那种人。如果你也不是,不用担心!我们可以使用一些简单(且诚实)的对话技巧,以更有说服力的方式表达自己。
Some of our colleagues can walk into a meeting full of stubborn engineers and within a half hour have everyone persuaded of their opinion. Unfortunately, I am not one of those people. If this isn’t you either, not to worry! There are a few easy (and honest) conversational tricks we can use to express ourselves in a more convincing manner.
我们中很少有人能免受奉承,包括你的经理。如果在谈话的任何时候,你和你的经理就某件事达成了一致,那就用赞美来强调它。例如,你和你的经理都同意重构是有益的,但你的经理希望在六个月后重新评估。你可以将焦点转移回重构的好处上,说:“你对重构的潜在好处提出了一些非常好的观点。很明显,你对我们遇到的问题有着细致入微的理解。”你的经理会想起他们发现的好处,并倾向于更重视这些好处与潜在的缺点。
Very few of us are immune to flattery, your manager included. If at any point during your conversation, you and your manager agree on something, highlight it with a compliment. For example, you and your manager agree that the refactor would be beneficial, but your manager would prefer reevaluating in six months. You can shift the focus back to the benefits of the refactor by saying, “You’ve made some really great points about the potential benefits of a refactor. It’s pretty clear you have a nuanced understanding of the problems we’ve been experiencing.” Your manager will be reminded of the benefits they identified and inclined to weigh them more heavily against the potential downsides.
您不仅应该为经理的任何反驳做好准备,甚至还可以考虑提出反驳意见。这听起来可能有点奇怪,但许多心理学研究表明,双方的论点比单方面的论点更有说服力。直接提出反驳意见有几个好处:
Not only should you be prepared for any counter-arguments from your manager, you might even consider bringing up the counter-arguments for them. It may sound a bit odd, but a number of psychological studies have shown that two-sided arguments are more convincing than one-sided arguments. There are a few benefits to presenting counter-arguments directly:
通过向您的经理证明您已经认真考虑过大规模重构的缺点,您进一步展示了您对这一努力的深思熟虑和彻底性。
By demonstrating to your manager that you’ve seriously considered the downsides of a large-scale refactor, you’re further demonstrating your thoughtfulness and thoroughness around the effort.
您正在重申经理的顾虑;虽然您可能不会直接称赞他们推理大规模重构缺点的能力,但您正在确认他们的担忧是合理的。如果您的经理觉得自己的想法得到了很好的理解,他们会更愿意听取您的想法。
You’re reaffirming your manager’s concerns; while you might not be outright complimenting them on their ability to reason about the drawbacks of a large-scale refactor, you are confirming that their apprehension is legitimate. Your manager will be more open to hearing about your ideas if they feel that their own ideas are well understood.
现在,利用反驳为自己辩护的技巧是谨慎反驳。让我们回顾一下 RadTunes 的例子,“经理需要协调”。您的经理计划让播放列表团队在接下来的一个季度中花费大部分时间来构建协作播放列表。您建议团队在开始开发新功能之前花费宝贵的时间来重写应用程序对歌曲的表示。
Now the trick to using counter-arguments in your favor is to refute them carefully. Let’s refer back to our RadTunes example, “Managers Need to Coordinate”. Your manager is planning for the Playlist team to spend most of the upcoming quarter building collaborative playlists. You’re proposing for the team to spend crucial time rewriting the application’s representation of a song before kicking off development on a new feature.
您可以告诉您的经理:“如果我们下个季度开始重构歌曲,我们将不得不将协作播放列表的工作推迟几个月。这肯定会让过去几年一直要求此功能的客户感到失望。”您可以立即通过跟进反驳来解决这个问题:“但是,我相信,如果我们重写我们的歌曲实现,我们将能够缩短几周的协作播放列表开发时间,并让搜索团队能够根据类型显示更好的结果。”
You could tell your manager, “If we began refactoring songs next quarter, we’d have to put off work on collaborative playlists for a few months. That would certainly be disappointing to our customers who have been requesting this feature for the past few years.” You can immediately address the issue by following up with a rebuttal: “However, I’m confident that if we rewrite our songs implementation, we’ll be able to shave several weeks off of collaborative playlist development and unblock the Search team on surfacing better results by genre.”
您甚至可以提出经理尚未提出的反驳,或者您怀疑他们根本不会提出的反驳。这听起来适得其反,但如果您成功驳倒反驳,它将提高您的可信度并加强您的立场。
You can even introduce a counter-argument your manager hasn’t brought up yet or one you doubt they’ll bring up at all. This sounds counter-productive, but it will boost your trustworthiness and strengthen your stance, assuming you successfully knock down the counter-argument.
虽然这不是程序员书架上常见的书,但我强烈建议你买一本戴尔·卡耐基的《如何赢得朋友和影响他人》。这本书出版于 80 多年前,但其中的大部分内容至今仍然适用。它所教授的技能不仅在你试图确保项目成功时对你有帮助,而且在你生活的各个方面也会对你有帮助!
While this is not a book you’d typically find on a programmer’s shelf, I highly recommend grabbing a copy of Dale Carnegie’s How to Win Friends and Influence People. It was published over 80 years ago, but most of its lessons continue to hold true today. The skills it teaches will be helpful to you not only when trying to secure buy-in for your projects, but in all aspects of your life!
如果您对玩弄办公室政治不感兴趣,那完全没问题,您可以直接跳到“指标”一节。另一方面,如果您有兴趣利用组织环境为自己谋利,那么您可以使用多种手段有效地迫使您的经理批准大规模重构。您可以构建一个对齐三明治,确保您的队友和高层管理人员的支持,将您的经理夹在两者之间。
If you are uninterested in playing office politics to your advantage, that’s perfectly all right, and you are welcome to skip ahead to “Metrics”. On the other hand, if you are interested in leveraging the organizational landscape to your benefit, there are a number of levers you can pull to effectively compel your manager into giving the go-ahead on a large-scale refactor. You can build an alignment sandwich, securing the support of your teammates along with the support of upper management, sandwiching your manager between the two.
这种方法只有在你得到双方的充分支持时才有效。如果你的经理只感受到来自你团队的压力,那么他们仍然会坚定地拒绝重构,因为他们知道上级不会对他们进行太多批评(如果有的话)。如果你的经理只感受到来自上级的压力,而你的团队没有口头支持(或者更糟的是,你的团队口头反对),他们不太可能继续推进该项目,因为他们知道这样做可能会损害团队士气。
This approach only works if you have ample support from both sides of the sandwich. If your manager only feels pressure from your team, then they’ll still be on solid footing to turn down the refactor, knowing there’ll be little flak (if any) from their superiors. If your manager only feels pressure from above, and your team is not vocally supportive (or worse, your team is vocally opposed), they’re unlikely to move forward with the project knowing they risk harming team morale.
请注意,这种策略可能会适得其反。根据您之前的谈话,您的经理知道您对进行此重构感兴趣。如果公司高层管理人员或其他有影响力的个人与他们联系,希望他们推进重构,他们很可能会将两者结合起来,推断出您一直在寻求外部影响。如果您与经理的关系不稳定,这可能会引起一些反弹。无论您与经理的关系有多强,请尝试坦率地告诉他们您已经寻求了外部意见;然后,不要让这些盟友直接联系您的经理,而是考虑安排你们三人开会讨论你们的观点。
Be mindful that this strategy can backfire. Given your previous conversation, your manager is aware that you’re interested in pursuing this refactor. If they are approached by upper management or other influential individuals at your company about moving forward with the refactor, there’s a chance they’ll put two and two together and deduce that you’ve been seeking external influence. If you have a tenuous relationship with your manager, this could lead to some backlash. Regardless of the strength of your relationship with your manager, try being upfront with them about having sought out external opinions; then, instead of having these allies reach out to your manager directly, consider setting up a meeting with the three of you to discuss your perspectives.
在与高层管理人员讨论重构之前,您应该花时间让您的队友达成共识。您可能已经在之前的调查阶段(收集指标以描述问题、起草执行计划)与一些队友讨论过重构的各个方面,以收集他们的反馈。对于尚未了解您的思维过程的队友,请花点时间向他们说明情况。这不必是正式的;给他们发消息或约他们喝杯咖啡即可。
Before reaching out to upper management about your refactor, you should take the time get your teammates on the same page. Chances are, you’ve probably discussed aspects of the refactor with some of your teammates throughout prior investigatory stages (collecting metrics to characterize the problem, drafting an execution plan) to gather their feedback. For the teammates who haven’t yet gotten a glimpse of your thought process, take some time to fill them in. This doesn’t have to be anything formal; shoot them a message or ask to grab a coffee.
您的最终目标是让他们为您的重构担保,无论是在您的经理在场的公共场合(在会议中、在公共聊天中、在电子邮件中),还是在他们与您的经理的一对一会面中。您可能需要与您的队友协调,以便不是所有人都在同一周的一对一会面中提出这个问题;诀窍是让每个人的兴趣看起来都是自然而然的,而不是有准备的。一旦您获得了队友的足够支持,您就完成了三明治的底层。
Your ultimate goal is to get them to vouch for the refactor either in a public setting where your manager is present (in a meeting, in a public chat, in an email), or in their own one on one with your manager. You may want to coordinate with your teammates so that not all of them bring it up in their one-on-ones the same week; the trick is to make everyone’s interest appear organic, not prepared. Once you’ve secured sufficient backing from your teammates, you’ll have built up the bottom slice of your sandwich.
如果您的经理对大规模重构不感兴趣,那么您经理的经理(称为越级经理)可能会感兴趣。高层管理人员往往对组织的目标以及当前和未来的项目有广泛的了解。鉴于这种更广阔的视角,您的越级经理可能比您的经理更能理解涉及大范围的重构,因为他们能够更好地直观地看到其好处的范围。
If your manager isn’t interested in pursuing a large-scale refactor, perhaps your manager’s manager (referred to as skip-level) will be. Upper-level management tends to have an expansive view of the organization’s objectives as well as its current and future projects. Given this broader perspective, your skip-level might be more sympathetic than your manager to a refactor spanning a large surface area because they are better able to visualize the scope of its benefits.
有些公司有严格的等级制度,直接与上级沟通会被视为非常失礼的行为。在预约与经理的经理交谈之前,请注意与经理的经理的对话可能会被如何看待。至少,在会议期间,要小心不要贬低你的经理;而要专注于培养对重构的兴趣和一致性。
Some companies have strict hierarchies where going directly to your skip-level is seen as a huge faux-pas. Be mindful of how a conversation with your manager’s manager might be perceived before booking time with them. At the very least, be careful not to put down your manager during your meeting; focus on building interest and alignment in your refactor instead.
如果您与您的越级领导已经存在关系,并且您有理由相信他们会支持您的努力,请与他们安排一次会议。您的初次谈话应该与您与经理的谈话类似(请参阅“初次谈话”)。这种交流应该有助于您辨别您的越级领导是否有可能支持您提议的重构。如果您确定他们不是强有力的支持者,那么您将需要寻求公司中其他有影响力的个人的支持,以充当你们联盟三明治中最顶层的部分。但是,如果他们表现出支持,请安排第二次会议。您可以讨论执行计划的细节,协调您需要的资源,并确定他们如何帮助您获得经理的批准。
If you have a preexisting relationship with your skip-level, and you have reason to believe that they would be supportive of your effort, schedule a meeting with them. Your initial conversation should be similar to the one you had with your manager (see “Initial Conversation”). This exchange should help you discern whether your skip-level is likely to advocate for your proposed refactor. If you determine that they aren’t a strong supporter, then you’ll want to seek the support of other influential individuals at your company to act as the top slice of bread in your alignment sandwich. If they appear supportive, however, schedule a second meeting. You can discuss the details of your execution plan, align on the resources you’ll need, and determine how they can help you get the approval of your manager.
无论重构愿望如何,与你的越级保持良好的关系都是非常有益的。事实上,如果可能的话,我强烈建议你每季度(甚至每月)与你的越级进行一对一的会谈。如果你想扩大你作为工程师的影响力,高层管理人员可以成为宝贵的资源;如果你需要通过领导组织中你所在的部门的一个有效项目来提高你的技能,他们将能够为你找到合适的项目。如果你正在寻求指导,他们可以将你与公司的其他高级工程师联系起来。与你的越级建立良好的关系还可以帮助您解决与你的直接经理的关系中出现的困难(如果出现的话)。
Having a strong relationship with your skip-level can be quite beneficial regardless of refactoring aspirations. In fact, I highly recommend holding quarterly (or even monthly) one-on-ones with your skip-level if at all possible. Upper management can be a valuable resource if you’re looking to expand your reach as an engineer; if you need to grow your skills by leading an effective project in your part of the organization, they’ll be able to identify the right project for you. If you’re seeking mentorship, they can connect you with other senior engineers at the company. Having an established relationship with your skip-level can also help you navigate difficulties in your relationship with your direct manager, if they ever arise.
在每家公司中,通常都有少数几个部门对业务拥有相当大的权力。当需要他们的意见时,他们的决定是最终决定,无论是决定如何设计新功能、如何运行新流程,还是如何解决错误。在许多行业(金融服务业、医疗保健、人力资源),这是法律和合规部门。如果你已经在目前的公司工作了几个月,你可能对哪个部门有所了解。如果你不太确定,可以问问你的同事;他们可能会有一两个关于安全部门参与事件的故事,或者销售团队对新功能的意见。
Within every company, there are typically a handful of departments that have considerable authority over the business. When their input is required, their decision is the final say, whether that’s a decision on how a new feature should be designed, a new process should operate, or a bug should be resolved. In many industries (the financial services industry, healthcare, human resources), this is the legal and compliance department. If you’ve been at your current company for several months, you likely have an inkling of which department that might be. If you’re not quite certain, ask your peers; they might have a story or two about the security department’s involvement with an incident or the sales team’s input on a new feature.
在某些情况下(并非全部),这些部门可能对你的重构有既得利益。以 Smart DNA 的合规团队为例,这是我们在“工作中”提到的生物技术公司。最重要的是,该团队负责确保其客户的 DNA 序列始终保持安全。由于安全补丁无法再应用,公司大多数系统都使用过时的 Python 版本,这可能是他们担心的问题。如果 Smart DNA 的研究团队不支持更新他们的 Python 依赖项,那么软件团队可以联系公司的合规团队,并列举运行不受支持的 Python 版本的多种漏洞方式。然后,合规团队会向必要的工程经理施压,让他们优先考虑迁移,从而为软件团队提供完整的三明治。
In some cases (not all), these departments might have a vested interest in your refactor. Take, for instance, the compliance team at Smart DNA, our biotechnology company from “At Work”. Above all else, the team is responsible for ensuring that the sequenced DNA of its customers remains safe at all times. Having most of the company’s systems using an outdated version of Python would likely be an area of concern for them, given that security patches can no longer be applied. If the research team at Smart DNA had not been in support of updating their Python dependencies, the software team could have reached out to the company’s compliance team and enumerated the many ways running an unsupported version of Python is a vulnerability. The compliance team would then put pressure on the necessary engineering managers to prioritize the migration, giving the software team its top slice of bread for a completed sandwich.
每家公司都有一两个极具影响力的工程师;这些工程师是技术人员中资历极深的成员(比如首席工程师和杰出工程师),他们在公司任职很长时间,或者在某些情况下两者兼而有之。他们中的许多人(如果不是大多数人的话)仍然深陷在代码中。如果他们熟悉你想要改进的表面区域,他们不仅会立即了解你的重构解决的问题,而且还会为你迄今为止的计划提供宝贵的见解。获得他们的支持对于向经理证明你的努力是至关重要的。在一些公司,没有什么比高级工程师的认可更重要的了。如果你能获得他们的赞许,你的对齐三明治将有一个坚固的上层切片。
Every company has a handful (or two) of highly influential engineers; these engineers are a combination of extremely senior members of your technical staff (think principal and distinguished engineers), have been at the company for a significant length of time, or, in some cases, both. Many of them, if not most, are still knee-deep in the code. If they’re familiar with the surface area you want to improve, not only will they immediately understand the problems your refactor addresses, but they’ll also have valuable insights to contribute to your plan to date. Securing their support can be crucial in legitimizing your effort to your manager. At some companies, there is no greater stamp of approval than that of a senior engineer. If you can garner their thumbs-up, your alignment sandwich will have a sturdy top slice.
如果你能获得多个上层影响力(你的越级、关键业务部门、极具影响力的工程师)的支持,那就更好了!你的对齐三明治不需要完美平衡;稍微头重脚轻只会使方法更加强大。
If you can rally the support of multiple upper-level influences (your skip-level, critical business departments, highly influential engineers), that’s even better! Your alignment sandwich doesn’t need to be perfectly balanced; leaning a little bit top-heavy only makes the approach more powerful.
如果您的经理偏爱逻辑论证,那么您应该使用第 3 章中收集的证据来支持您的立场。与您的经理安排一些时间继续您的初步对话。告诉他们,您对重构进行了更多的思考,并且您花时间描述了问题,以便他们能够更好地理解其价值(并且,希望是其紧迫性)。
If your manager is partial to logical arguments, then you should use the evidence you gathered in Chapter 3 to bolster your position. Set up some time with your manager to continue your initial conversation. Tell them that you’ve given the refactor more thought, and that you’ve spent time characterizing the problem so they might better appreciate its value (and, hopefully, its urgency).
在会议开始前,准备好证据。如果你已经收集了大量证据,那就把重点放在两三个最令人吃惊的部分上。有些指标以视觉形式传达效果更好,所以可以考虑制作一两个图表来更好地说明你想要强调的要点。花时间将这些信息综合成一种你的经理容易理解的媒介,这有几个好处。首先,它会给你一份全面的文件,你可以把它传阅给公司里其他感兴趣的人。这在获得跨职能支持或招募队友时很有用,我们将在第 6 章中介绍这些内容。其次,你会有一些在会议中可以参考的东西。对于那些对自己说服他人的能力没有信心的人来说,在整个讨论过程中,有一套明确的主题要点可以参考,会产生很大的不同。
Ahead of your meeting, prepare your evidence. If you’ve gathered an abundance of evidence, focus on the two or three most startling pieces. Some metrics are better communicated in visual form, so consider putting together a graph or two to better illustrate the points you want to emphasize. Taking the time to synthesize this information into a medium that’s easy for your manager to consume is beneficial for a few reasons. First, it’ll give you a comprehensive document you can circulate to other interested individuals at the company. This can be useful when garnering support cross-functionally or recruiting teammates, which we’ll cover in Chapter 6. Second, you’ll have something you can reference during your meeting. For those of us who aren’t confident in their ability to persuade others, having a clear set of topic points you can reference throughout your discussion can make all the difference.
对于那些比较胆小、尚未建立必要的社会资本来依靠有影响力的同事或与经理打交道的工程师,我建议主要依靠指标论据。事实很容易准备,很容易记住,而且通常很难反驳。
For engineers who are a bit more timid and haven’t yet built the social capital necessary to lean on influential colleagues or play hardball with their manager, I recommend relying heavily on the metrics argument. Facts are easy to prepare, easy to memorize, and usually difficult to refute.
如果您非常确信您的重构对业务至关重要,而您的经理不愿意让步,您可以考虑一些更严厉的选择。提出这些严厉的选择通常被称为强硬态度。需要注意的是:这两种方法中的任何一种都可能严重危及您与经理和同事的关系。然而,如果成功,它们可能非常有效,并且,如果您的重构被证明是成功的(鉴于您正在阅读这本书,它肯定会成功),可以推动您的职业生涯向前发展。
If you are exceedingly confident that your refactor is critical to the business and your manager is unwilling to budge, there are a few more severe options you can consider. Presenting these severe options is often referred to as playing hardball. A word of caution: either of these approaches can seriously jeopardize your relationship with your manager and fellow colleagues. When successful, however, they can be really effective, and, if your refactor proves successful (which it most definitely will be, given you are reading this book), can catapult your career forward.
需要注意的是,并非每个人都有足够的实力(无论是在他们目前所在的公司还是在财务上)来与他们的经理硬碰硬,这没关系!你需要在目前的职位上积累了相当大的影响力,并建立了长期的良好表现,才能做到这一点。
It’s important to note that not everyone is in a strong enough position (either in their role at their current company or financially) to play hardball with their manager, and that’s okay! You need to have built up quite a bit of clout and established a long history of good performance in your current role to be able to pull this off.
在我们深入讨论之前,最后要说明的是:对于这两种策略,你必须愿意坚持到底。如果你的经理揭穿你的虚张声势,并且仍然不相信,这不仅有可能破坏你们的关系,还会削弱你在出现另一个重要项目时成功采取类似方法的能力。
One final note before we dive in: with both of these tactics, you must be willing to follow through. If your manager calls your bluff and remains unconvinced, not only does it risk eroding your relationship, it diminishes your ability to take a similar approach successfully when another important project comes along.
当确实需要大规模重构某些内容时,通常表明幕后正在进行大量不平凡的工作以保持正常运行。管理层通常不知道这项工作,或者,即使知道,他们也没有意识到其重要性。如果您积极、定期地寻找方法来缓解重构旨在解决的问题,您可以警告您的经理您不再打算做这项工作。这样做的目的是停止做任何看不见的工作,因为这些工作会阻止公司管理层看到重构旨在解决的问题。
When there is a serious need to refactor something at a large scale, it usually indicates that there is an amount of nontrivial work going on behind the scenes to keep things operational. Management is typically unaware of this work, or, if they are, they do not recognize its importance. If you are actively, regularly finding ways to mitigate the problem your refactor aims to solve, you can warn your manager that you no longer plan to do this work. The idea is to stop doing any invisible work that is preventing management at your company from seeing the problem your refactor aims to solve.
以 SmartDNA 的 Python 迁移为例。在 Python 2.7 推广到所有环境之前,每当有安全补丁可用时,您的团队都需要花费宝贵的时间将补丁移植到过时的 Python 2.6 系统。由于无法预测安全补丁,因此每当发现新漏洞时,您的团队都必须暂停所有功能工作,并将精力转移到移植补丁上。这种维护工作极其耗时且风险很高,但在当时的情况下却是必要的。不幸的是,管理层不愿意承认运行过时软件的运营成本。
Take our Python migration at SmartDNA, for instance. Before Python 2.7 was rolled out to all environments, whenever a security patch was was made available, your team needed to spend valuable time porting the patch to the outdated Python 2.6 systems. Because security patches cannot be anticipated, any time a new vulnerability was discovered, your team had to pause all feature work and divert its energy to porting the patch. This kind of maintenance work was extremely time-consuming and high-risk, but necessary under the circumstances. Unfortunately, management was unwilling to recognize this operational cost of running outdated software.
在这种情况下,您可以向经理施加压力,要求他优先考虑 Python 升级,方法是建议团队不再移植任何可用的新安全补丁。告诉您的经理,您正在尝试为团队设置适当的界限;鉴于您的团队主要专注于功能开发,您可以断言,支持研究团队运行的旧版软件并不是一项严格的责任。如果在您的季度或年度计划过程中,您的经理没有正确说明定期移植新补丁所涉及的工作,请务必强调这一点。
In this scenario, you could put pressure on your manager to prioritize the Python upgrade by suggesting that the team would no longer port any new security patches as they become available. Tell your manager that you are trying to set appropriate boundaries for the team; given that your team is primarily focused on feature development, you can assert that supporting legacy software run by the research team is not strictly a responsibility. If during your quarterly or yearly planning process, your manager does not properly account for the work involved with porting a new patch on a regular basis, make a point to highlight that.
是的,你划了一条硬线。你甚至可能会因为不再做重要的维护工作而感到内疚(大多数开发人员认为这是他们工作中的关键部分)。这完全正常。我以前也提出过这个论点,担心自己不负责任,让公司失望。我意识到,坚持自己的立场,恰恰相反;我向企业展示了它有一个重要的盲点,以及这个盲点的重要性。通过与你的经理重新定义期望,你正在揭示在不进行大规模重构的情况下保持系统正常运行所涉及的工作的普遍性。
Yes, you are drawing a hard line. You might even feel guilty for no longer doing important maintenance work (which most developers believe is a critical part of their job). That’s completely normal. I’ve made this argument before and worried about being irresponsible, letting the company down. What I came to realize was that by holding my ground, I was doing just the opposite; I was showing the business where it had an important blindspot, and the significance of that blindspot. By redefining expectations with your manager, you are shedding light on the pervasiveness of the work involved to keep your systems operational without a substantial refactor.
如果其他方法都失败了,你可以向你的经理建议,如果他们继续反对重构,你要么调到另一个团队,要么直接辞职。如果你想留在同一家公司,并且能够换团队,在向你的经理提出之前,先确定你有兴趣加入的团队;更好的办法是,试着在公司其他地方找到一位支持重构并有兴趣让你加入他们团队的经理。如果换团队不在考虑范围内,你可能会威胁辞职。你应该仔细考虑这个决定,并在与你的经理交谈之前认真检查你是否拥有这样做所需的经济稳定性。
If all else fails, you can suggest to your manager that if they continue to oppose the refactor, you’ll either transfer to another team or outright quit the company. If you want to stay at the same company and are able to switch teams, identify a team you are interested in joining before bringing it up with your manager; better yet, try to find a manager who is supportive of the refactor elsewhere at the company and is interested in having you join their team. If switching teams isn’t on the table, you might threaten to quit. You should thoughtfully consider this decision, and seriously examine whether you have the necessary financial stability to do so before speaking to your manager.
这不是与你的经理轻松交谈的话题。首先,提出你担心公司没有更认真地对待你发现的问题。如果你的经理渴望让你留在他们的团队中,他们可能会重新评估并允许重构继续进行。
This is not an easy conversation to have with your manager. First, bring up that you’re concerned that the company isn’t taking the problems you’ve identified more seriously. If your manager is eager to keep you on their team, they might reassess and allow the refactor to move forward.
尽管在编写一行代码之前就已获得认可,但这可能是大规模重构最困难的方面之一。管理人员可能会担心启动一项漫长的、以工程为中心的工作,这是有充分理由的;他们在工程组织内有自己的一套约束和激励措施。话虽如此,我们每个人都有能力学习和掌握技巧,以说服他们,尽管有任何疑虑,但这种努力是值得的。我们可以发现如何有效地依靠更广泛组织中的队友和同事来为我们提供所需的额外支持。
Even though securing buy-in happens well before a single line of code is written, it can be one of the most difficult aspects of refactoring at scale. Managers can be apprehensive to kicking off a lengthy, engineering-focused endeavor with good reason; they have their own sets of constraints and incentives within an engineering organization. That said, each of us has the ability to learn and master techniques to convince them that the effort is worthwhile despite any misgivings. We can discover how to lean effectively on our teammates and colleagues in the broader organization to give us the additional support we need.
尘埃落定之后,根据您获得的支持程度,您可能会或可能无法执行重构。如果您的经理仍然持怀疑态度,请考虑暂时搁置该项目。您可以继续收集支持证据,等待更合适的时机重新提出该主题。例如,如果您的公司因重构试图解决的问题而发生事故,这可能是与您的经理重新对话的好时机。下次您的团队进入长期规划阶段时,请考虑再次提出重构。保持警惕并留意任何为您的重构带来新见解的机会。
After the dust has settled, depending on the degree of buy-in you’ve obtained, you may or may not be able to execute on your refactor. If your manager remains skeptical, consider shelving the project for now. You can continue to accrue supporting evidence, waiting for a more opportune moment to reintroduce the subject. For instance, if your company suffers from an incident caused by a problem your refactor seeks to solve, this might be a good time to revive the conversation with your manager. The next time your team enters its long-term planning phase, consider proposing the refactor once more. Keep a watchful eye and an ear to the ground for any opportunities to shed new light on your refactor.
如果您获得了支持,无论是热情的同意还是冷淡的点头,您都需要利用这种支持为您的项目争取资源。您需要确定哪些工程师需要为重构提供最大的成功机会,以及在哪些阶段需要他们的专业知识。我们将讨论您在做出这些决定时需要了解的所有信息 第六章……
If you’ve acquired buy-in, whether that’s an enthusiastic yes or a lukewarm nod, you’ll need to leverage that support to garner resources for your project. You’ll need to determine which engineers are required to give the refactor its greatest chance of success, and at which stages their expertise will be needed. We’ll discuss everything you need to know to make these decisions in Chapter 6.
《十一罗汉》是一部出现在每个人最爱名单上的抢劫电影。故事以丹尼·奥申出狱开始。他与犯罪伙伴兼朋友拉斯蒂·瑞恩会面,提出抢劫。计划是从拉斯维加斯的三家赌场偷走 1.5 亿美元:贝拉吉奥、幻影和米高梅大酒店。这两名窃贼知道他们无法独自完成抢劫,因此他们开始召集一帮罪犯,包括一名前赌场老板、一名扒手、一名骗子、一名电子和监控专家、一名爆破专家和一名杂技演员。
Ocean’s 11 is one of those heist films that shows up on everyone’s list of favorites. It starts off with Danny Ocean getting released from prison. He meets up with his partner in crime and friend Rusty Ryan to propose a heist. The plan is to steal $150,000,000 from three Las Vegas casinos: the Bellagio, the Mirage, and the MGM Grand. The two thieves know they can’t pull off the heist alone, so they start gathering a crew of criminals, including a former casino owner, a pickpocket, a con man, an electronics and surveillance expert, an explosives professional, and an acrobat.
团队分成两组:第一组了解贝拉吉奥赌场的来龙去脉,了解员工的日常工作,并收集赌场运营的细节;第二组建造赌场金库的复制品,练习如何突破其复杂的安保系统。几天之内,团队制定了一个计划。恶作剧接踵而至,障碍被躲过,(剧透警告!)团队最终带着现金逃走。
The team splits up into two groups: the first group gets to know the ins and outs of the Bellagio, learning the routines of the staff and gathering details on how the casino operates; the second group builds a replica of the casino vault to practice maneuvering past its challenging security system. Within a few days, the group hatches a plan. High jinks ensue, hurdles are dodged, and (spoiler alert!) the team eventually escapes with the cash.
Ocean 和 Ryan 绝对不可能独自抢劫 Bellagio 酒店。他们不仅需要数月时间才能筹集到抢劫所需的资金,而且仅凭他们两人也不可能制定出一个合理的计划来绕过赌场的防御措施。通过组建一支规模合适、技能合适的团队,他们缩短了执行时间并增加了成功的机会。
Ocean and Ryan could never have robbed the Bellagio alone. Not only would they have needed months to gather the financial resources required to prepare for the heist, it’s unlikely that they could have concocted a reasonable plan to bypass the casino’s defensive measures by only the two of them. By assembling a team just the right size with just the right skills, they cut down on their execution time and increased their chances of success.
为了成功执行大型重构工作,我们需要自己的“十一罗汉”。丹尼在被关押在新泽西期间花了几个月的时间反复构思他的抢劫计划;从他的蓝图中,他得出了一份他需要的技能和专业知识清单,以及具备这些能力的潜在候选人的名字。在本章中,我们将学习如何组建不同类型的团队,这取决于我们最有效地执行重构工作所需的专业知识类型。作为技术主管,我们将学习如何缩小潜在队友的名单,并说服他们加入我们的旅程。最后,我们将讨论如何在一个不幸的情况下取得最佳效果:需要独自执行项目。
To execute on a large refactoring effort successfully, we need our own Ocean’s 11. Danny spent months iterating on his heist while locked up in New Jersey; from his blueprint, he derived a list of skills and expertise he needed, along with the names of potential candidates with these abilities. In this chapter, we’ll learn how to assemble different kinds of teams, depending on the kind of expertise we require to execute on our refactoring effort most effectively. As technical leads, we’ll learn how to narrow our list of potential teammates and convince them to join us on our journey. Finally, we’ll discuss how to make the best of an unfortunate situation: needing to execute on the project alone.
在第 4 章中,我们学习了如何起草有效的行动计划。我们学习了如何通过几个简明扼要的顶级里程碑和一些关键子任务来捕捉和综合重构工作中的重要复杂性。
In Chapter 4, we learned how to draft an effective plan of action. We learned how to capture and synthesize the important complexity of our refactoring effort in a few concise, top-level milestones with a handful of critical subtasks.
由于我们大多数人都是与其他几位工程师在一个团队中工作的,因此我们的计划很有可能是合作制定的,我们打算作为一个团队来执行。然而,在执行大规模重构时,我们几乎总是需要公司不同团队同事的帮助。另一方面,有时我们会单独或与一两位其他工程师一起确定范围并规划重构工作。无论哪种情况,我们都可以使用我们的计划来准确确定我们需要哪些工程师以及何时需要。
Because most of us work on a team with a few other engineers, there’s a strong likelihood that our plan was derived cooperatively and we intend to execute it as a team. When executing a large-scale refactor, however, we almost always need some help from colleagues on different teams across the company. On the other hand, there are times when we scope out and plan a refactoring effort either alone or with just one or two other engineers. In either case, we can use our plan to figure out precisely which engineers we’ll need and when.
我们可以从重读计划开始。在执行每个步骤时,我们尝试将需要交互的代码形象化。我们能轻松地想象出来吗?我们能否自信地确定需要进行的更改并推断这些更改的潜在影响或下游效应?我们是否了解在代码库的给定区域中可能遇到的陷阱?我们是否了解我们想要进行的更改对产品的潜在影响?我们是否非常熟悉我们将直接或间接接触的技术?如果是这样,那就太好了!我们可能很适合自己进行这些更改。如果不是,那么我们需要别人的帮助。我们可以通过两种方式之一寻求某人的帮助,要么作为积极的贡献者,要么作为主题专家。
We can start by rereading our plan. As we go through each step, we try to visualize the code we’ll need to interact with. Can we conjure it up easily? Can we confidently identify the changes we need to make and reason through the potential impact or downstream effects of those changes? Do we understand the pitfalls we might run into in the given area of the codebase? Do we understand the potential product implications of the changes we want to make? Are we deeply familiar with the technologies we’ll either be directly or indirectly interfacing with? If so, great! We’re probably in a good position to make those changes ourselves. If not, then we’ll need someone else’s help. We can enlist someone for help in one of two ways, either as an active contributor or as a subject matter expert.
积极贡献者会积极参与项目,最好从第一天开始。他们会和你一起编写代码,积极地为项目做出贡献。在执行计划的早期阶段以及每次修订过程中,都应咨询积极贡献者的意见。
An active contributor is heavily involved with the project, ideally from day one. They are actively contributing to the effort by writing code alongside you. Active contributors should be consulted for input on the execution plan early and through each of its revisions.
主题专家(简称 SME)不是您工作的积极贡献者。他们同意与您讨论解决方案、回答问题,甚至可能进行一些代码审查。虽然他们的贡献可能非常有意义,但他们对项目的时间投入却很少。他们的主要关注点仍在与您的项目不同的其他项目上。
Subject matter experts, or SMEs for short, are not active contributors to your effort. They’ve agreed to be available to talk through solutions with you, answer questions, and maybe do some code review. While their contributions can be very meaningful, their time commitment to the project is minimal. Their primary focus remains on other projects distinct from yours.
让我们通过一个示例项目来更具体一点。贵公司的监控和可观察性团队正在从一个指标收集系统迁移到另一个(可能是从 StatsD 迁移到 Prometheus)。他们已经建立了基础设施,配置了一些节点,现在准备开始接受来自您的应用程序的流量。团队需要一到两个非常熟悉应用程序如何使用 StatsD 的开发人员来帮助完成过渡。作为其中的一员,您决定伸出援手,编写一个新的内部库来与新解决方案交互,并最终取代当前的库。您需要确保 Prometheus 库提供与当前库相同的功能以及干净、直观的 API。您的最后一项任务是建立使用新库的最佳实践,并鼓励整个工程组织采用它。
Let’s make this a bit more concrete by working through an example project. Your company’s monitoring and observability team is migrating from one metrics-collection system to another (maybe StatsD to Prometheus). They’ve built up the infrastructure, provisioned some nodes, and are now ready to start accepting traffic from your application. The team needs one or two developers who are intimately familiar with how the application uses StatsD to help with the transition. Being one of those people, you’ve decided to lend a hand by writing a new internal library to interface with the new solution and ultimately replace the current library. You’ll need to ensure that the Prometheus library offers feature parity with the current one and a clean, intuitive API. Your final task will be to establish best practices for using the new library and encourage its adoption across the engineering organization.
您无需对新指标收集系统的工作原理了如指掌,即可熟练地完成工作。您可以在需要时依靠监控团队,如果他们发现应用程序集成过程中出现异常,他们也可以依靠您。在此示例中,您是与监控团队合作的积极贡献者。
You don’t need to have intimate knowledge of how the new metrics-collection system works to do your job proficiently. You can lean on the monitoring team when needed and it can lean back on you if it notices something odd about the integration process with your application. In this example, you’re an active contributor collaborating with the monitoring team.
在审核 StatsD 库的使用情况时,您注意到另一个产品开发团队正在以与大多数其他团队不同的方式使用该库。您想了解团队为什么以这种方式使用该库,以及这种行为是否绝对需要在新系统中复制。如果这种行为是必要的,您必须确保 Prometheus 可以适应它。您联系了团队中的几个人,看看他们是否有时间回答您的问题。一位团队成员,我们称他为 Frankie,热切地同意与您见面。经过简短的交谈,您得出结论,新的 Prometheus 库应该支持这种行为,Frankie 同意在您构建功能时审查您的代码。在这种情况下,Frankie 是 SME。
While auditing uses of the StatsD library, you notice that another product development team is using it in a way that is distinct from most other teams. You want to understand why the team is using the library in this way, and whether this behavior absolutely needs to be replicated in the new system. If this behavior is necessary, you have to make sure that Prometheus can accommodate it. You reach out to a few folks on the team to see whether they might have time to answer your questions. One team member, let’s call them Frankie, eagerly agrees to meet with you. After a quick chat, you come to the conclusion that the behavior should be supported in the new Prometheus library, and Frankie’s agreed to review your code as you build out the functionality. Frankie, in this scenario, is a SME.
您可能需要多种专业知识才能成功执行重构工作。以我们的指标收集示例为例,我们需要监控团队对 StatsD 和 Prometheus 的技术专业知识、Frankie 对特定用例集的产品专业知识,以及我们自己对代码库如何使用指标收集库的专业知识。我们甚至可能希望咨询安全团队的某个人,以确认没有敏感的客户数据最终流经新系统(如果有,我们会采取措施迅速遏制它)。
You might need a number of types of expertise to execute your refactoring effort successfully. With our metrics-collection example, we needed the monitoring team’s technology expertise with StatsD and Prometheus, Frankie’s product expertise with a specific set of use cases, and our own expertise with how the codebase uses the metrics-gathering libraries at large. We might even want to consult with someone from the security team to confirm that no sensitive customer data ends up flowing through the new system (and if it does, we have measures in place to contain it swiftly).
在列举您可能需要的每种专业知识时,请注意范围。大规模重构通常影响很大,因此如果您最终得到一个很长的列表,也不足为奇。别担心,我们接下来会学习如何缩小列表。
When enumerating each of the kinds of expertise you’ll likely need, keep an eye out for a range. Refactoring at scale typically affects a large surface area, so it shouldn’t be surprising if you end up with a lengthy list. Don’t worry, we’ll learn just how to narrow that list down next.
我们现在已成功起草了一份清单,列出了我们在执行重构工作时需要的专业知识类型;对于我们的指标收集重构,我们需要一名技术专家、一名产品专家,最后还有一名安全专家。除了每种专业知识,我们还标出了需要该专业知识的主要项目里程碑;如果在多个里程碑中都需要专业知识,只需注明最早需要帮助的里程碑即可。在开始集思广益寻找潜在专家之前,我们的最后一步是标明我们认为每种专业知识是否需要 SME 或积极贡献者。我们现在可以将其记下来,因为随着我们与潜在候选人会面并确定他们对项目的参与程度,我们预计专家的角色可能会发生变化。
We’ve now successfully drafted a list of types of expertise we want available to us while we execute our refactoring effort; for our metrics-collection refactor, we need a technology expert, a product expert, and, finally, a security expert. Alongside each of the kinds of expertise, we denote the major project milestone at which point that expertise will be needed; if the expertise is needed throughout multiple milestones, simply note the earliest milestone when that help will be needed. Our final step before beginning to brainstorm potential experts is to label whether we think we’ll need an SME or an active contributor for each expertise. We can pencil this in for now, because the role we anticipate the expert to have might change as we meet with potential candidates and work out their involvement with the project.
最后,我们必须将每项专业知识与一个或多个人相匹配。从列表的开头开始,对于每一项,写下您能想到的前几个个人或团队的名字。
Finally, we have to match each expertise with one or more people. Start from the beginning of the list and, for each item, write the first few names of either individuals or teams that come to mind.
如果您在一家大公司工作,或者不了解不同工程团队的人员,那么您可能很难找到每种专业知识的专家。没关系!您可以从确定一个部门开始。如果您可以访问更新的组织结构图,请使用它尝试在您确定的部门中找到最佳团队。不要害怕利用您的经理来帮助您生成并随后减少专家名单。他们的部分工作是确保团队拥有高效执行项目所需的所有资源,并且他们可能对组织中哪些团队适合提供帮助有更好的了解。
If you work at a large company or haven’t gotten to know folks across different engineering teams, you may have a difficult time coming up with experts for each expertise. That’s okay! You can start off by identifying a department. If you have access to an updated organization chart, use it to try to locate the best team within the department you identified. Do not be afraid to leverage your manager to help you generate and subsequently reduce the list of experts. Part of their job is to make sure that the team has all the resources it needs to execute projects efficiently, and they likely have much better insight about which teams across the organization are well suited to help out.
如果您无法访问更新的组织结构图,但您的工程团队有轮班值班制度,并使用 PagerDuty 等服务向工程师发出事件警报,那么您可能能够通过参考这些轮班制度找到合适的专家。查找您要为其寻找专家的功能或基础架构组件,然后找到具有相应轮班值的团队。瞧!
If you don’t have access to an updated organization chart but your engineering team has on-call rotations and uses a service like PagerDuty to alert engineers about incidents, you might be able to find the right experts by referencing these rotations. Look for the feature or infrastructural component for which you’re seeking an expert and find the team with the corresponding on-call rotation. Voila!
继续记下名字,直到记完为止。表 6-1显示了我们为指标收集迁移制定的示例列表。
Continue to jot down names until you’ve run out of items. Table 6-1 shows an example list we came up with for the metrics-gathering migration.
| 知识领域 | 里程碑 | 角色 | 专家 |
|---|---|---|---|
了解订单履行代码如何使用 StatsD(与大多数其他产品功能不同) Understand how the order fulfillment code uses StatsD (distinct from most other product features) |
1 1 |
中小企业 SME |
Frankie、Mackenzie、订单处理团队 Frankie, Mackenzie, Order Processing Team |
库与 Prometheus 之间的自动端到端测试 Automated end-to-end testing between library and Prometheus |
2 2 |
活跃贡献者 active contributor |
Jesse,自动测试团队 Jesse, Automated Testing Team |
当团队开始采用 Prometheus 时,监控应用程序流量 Monitoring application traffic to Prometheus as teams begin to adopt it |
3 3 |
中小企业 SME |
监察组 Monitoring Team |
我们的应用程序部署管道将如何影响 Prometheus 节点 How our application deployment pipeline will affect Prometheus nodes |
1 1 |
中小企业 SME |
Jesse,发布和部署团队 Jesse, Release & Deploy Team |
收集客户指标的安全隐患;我们应该特别小心监控有安全意识的客户 Security implications of gathering metrics about customers; security-conscious customers we should be particularly careful about monitoring |
1 1 |
中小企业 SME |
产品安全团队 Product Security Team |
接下来,突出显示出现多次的名字。在我们的示例集中,重叠的并不多,但我们注意到 Jesse 可能是这五项中的两项的合适人选。您的公司可能有许多拥有广泛专业知识的高级工程师,他们可能对您的重构有所帮助。与恰好是多个相关主题专家的人交流可以在许多方面有所帮助。
Next, highlight any names that pop up more than once. There isn’t much overlap in our example set, but we notice that Jesse might be a good candidate for two of the five items. Your company may have a number of senior engineers with a wide breadth of expertise that could be helpful to your refactor. Conferring with someone who happens to be an expert on multiple relevant topics can be helpful on many fronts.
首先,它可以帮助我们减少完成项目所需协调的总人数。仅与一个团队协调一个大型项目就已经很困难了,更不用说协调一个涉及多个团队的多名开发人员的大型项目了。每个贡献者不仅需要参与工作并跟上进度,还必须适应团队的开发流程(即每周或每天的站立会议、每月回顾等)。要让每个人都协调一致并以良好的节奏运作,可能需要花费大量的时间和精力。
First, it can help us decrease the total number of people we’ll need to coordinate with to complete our project. Coordinating a large project with a single team can be difficult, never mind coordinating a large project involving multiple developers across a number of teams. Each contributor not only has to be pitched on the effort and brought up to speed, but they must also adapt to your team’s development process (i.e., weekly or daily stand-ups, monthly retrospectives, etc.). It can take a considerable amount of time and effort before everyone is well-aligned and operating at a good pace.
其次,碰巧对项目的多个重要方面有深入了解的专家可能对这些部分如何协同工作有敏锐的洞察力。这可能是公司中其他一些工程师所分享的宝贵见解。根据表 6-1中的专家示例列表,Jesse 很可能是其中之一。从与他们的对话中,我们知道他们与发布和部署团队密切合作了几个月,帮助其为公司两个重要服务构建了基于百分比的发布系统。我们还知道,在那个项目之后,Jesse 转到了内部工具团队,他们在那里致力于提高自动化测试环境的可用性。Jesse 只是在公司工作了一段时间的工程师之一,他参与过一系列项目,对每个部分如何协同工作有着敏锐的洞察力。
Second, experts who happen to have a deep understanding of multiple important aspects of the project likely have a strong perspective on how these pieces work together. This can be valuable insight shared by few other engineers at the company. Given our sample list of experts in Table 6-1, Jesse is likely one of those individuals. From our conversations with them, we know that they’ve worked closely with the release & deploy team over several months to help it build a percentage-based release system for two important services at the company. We also know that after that project Jesse moved to the internal tools team, where they worked to improve the availability of automated testing environments. Jesse is just one of those engineers who’s been at the company for a while, worked on a laundry list of projects, and has keen insight into how each of these pieces works together.
不幸的是,像 Jesse 这样的人可能非常忙(可能是因为他们作为 SME 为许多项目提供意见,此外还领导着一些自己的项目)。如果他们无法定期提供帮助,但您认为他们的独特知识对重构工作至关重要,请让他们审查您的执行计划。我发现他们的意见在验证我最不自信的时间估计时特别有用。如果您正在寻找一位专家积极参与您的项目,他们将能够推荐另一位或两位专家来代替他们。
Unfortunately, people like Jesse can be quite busy (probably because they’re providing input on a number of projects as an SME, in addition to leading a few of their own). If they are not available to help in a regular capacity but you believe their unique knowledge is critical to the refactoring effort, offer to have them review your execution plan. I’ve found their input particularly helpful in verifying my least-confident time estimates. If you’re looking for an expert to be actively involved in your project, they’ll be able to suggest another expert or two to replace them.
如果很少有人(或根本没有)重叠,而且您所需的专业知识列表很长,请不要担心!您仍然可以仅使用少数足智多谋的人员成功执行大规模重构。
If very few names (or none at all) overlap and your list of required types of expertise is quite lengthy, not to worry! You can still successfully execute a large-scale refactor with just a handful of resourceful individuals.
对我来说,一个好的经验法则是将活跃贡献者的数量限制在你过去最习惯合作的团队规模内。如果你曾经在由 6 人组成的成功产品工程团队工作过,那么就将你的团队限制为 6 名活跃贡献者。每个人在不同公司与不同团队合作的经历都略有不同;你最了解自己和你喜欢的工作条件,所以选择你所知道的最有效的方法。大型重构项目从流程和技术角度来看都足够复杂;不要让你的团队成为另一个潜在的难题。
For me, a good rule of thumb is to limit the number of active contributors to the size of team you’ve been most comfortable working with in the past. If you’ve been on successful product engineering teams of six, then limit your team to six active contributors. Everyone’s experience working with different teams at different companies is a little bit different; you know yourself and your preferred working conditions best, so go with what you know to be most effective. Large refactoring projects are plenty complex enough from both process and technical standpoints; don’t let your team be yet another potential curveball.
如果您觉得活跃贡献者的名单太长,请查看您的名单,看看是否有任何专业知识可以寻求 SME 的帮助。与 SME 协调的协调成本要低得多,因为他们只是临时咨询。我们将在第 7 章中介绍一些与 SME 有效沟通的策略。
If your list of active contributors feels too long, review your list and see whether there is any expertise for which you can instead seek out the help of an SME. Coordinating with SMEs comes at a much lower coordination cost because they are only consulted on an ad hoc basis. We’ll cover some strategies for effectively communicating with SMEs in Chapter 7.
如果我们碰巧认识某个人,他可能是我们清单上一项或多项事项的专家,我们可能会直接向他们寻求帮助。很有可能,他们会非常乐意提供帮助。毕竟,向你认识的人寻求帮助可能是最方便的选择。如果你们以前曾合作过,你们将能够很快建立一种对双方都有效的节奏,并尽早开始取得一些显著的进展。
If we happen to know someone who could be a valuable expert on one or more of the items on our list, we might ask them for their help directly. Chances are, they’ll be more than happy to help out. After all, asking someone you know is probably the most convenient option. If you’ve worked together previously, you’ll be able to establish a cadence that works well for both of you pretty quickly and begin making some salient progress early.
然而,直接向同事寻求帮助也有其弊端。众所周知,软件工程师不善于估计完成一项任务需要多少时间和精力。这通常是软件工程师需要坚持不懈的乐观主义的结果。当某项请求看似很小的时候,我们的同事有时会过早地答应,而没有花太多时间去正确评估承诺的范围。他们可能直到项目启动后才意识到,他们答应了太多事情,现在正努力兼顾所有事情。(我曾经就是这样的人,相信我,对太多事情说“是”和对所有事情说“不”一样没有帮助。)
Asking a colleague for help directly can have its drawbacks, however. Software engineers are notoriously bad at estimating how much time and effort a task will take. This is often a consequence of the relentless optimism that being a software engineer requires. When something seems like a small request, sometimes our colleagues can be a little too quick to say yes, not taking much time to scope the commitment properly. They may only realize well after the project’s kicked off that they’ve said yes to a few too many things and are now struggling to juggle it all. (I’ve been that person and trust me when I say that saying yes to too many things is just as unhelpful as saying no to everything.)
一个直接向同事寻求帮助的另一个问题是,你可能会忽略更适合这个职位的其他人。我们都有许多偏见,我们必须有意识地努力去消除这些偏见。其中一种偏见是近期偏见,即我们倾向于更快地回忆起最近看到的东西。如果我们最近听到过同事的名字或与他们交谈过,我们更有可能将他们列为潜在的优秀专家。在最终确定专家名单之前,我们需要注意这种偏见,花点时间思考每位专家是否真的是这份工作的最佳人选,还是我们只是碰巧在几天前的电子邮件中看到了他们的名字。如果我们认为可能有更合格的候选人,我们应该进行研究,考虑联系团队而不是个人。专家团队的经理可以审查你向每个开发人员提出的帮助请求,并评估他们的兴趣。优秀的经理会找出团队中那些可以做出有意义贡献的人,但他们也会 从为你的重构工作做出贡献的知名度和职业发展中受益最多。
Another problem with asking a colleague for help directly is that you might overlook others who are better suited for the role. We all suffer from a number of biases we must consciously work to counteract. One such bias is recency bias, when we tend to recall things we’ve seen more recently more quickly. We are more likely to list a colleague as a good potential expert if we’ve heard their name or spoken to them more recently. We need to be mindful of that bias before we finalize our expert list, and take a minute to question whether each expert truly is the best one for the job or if we just happened to see their name copied on an email a few days ago. If we think a more qualified candidate might be available, we should do our research and consider contacting a team rather than an individual. Managers of expert teams can vet your request for help to each of their developers and gauge interest. Great managers will identify those on their team who could contribute meaningfully but also would benefit the most from the visibility and career growth of contributing to your refactoring effort.
同样重要的是不要将专业知识与资历混为一谈。Frankie 可能不是行业经验最丰富或在公司任职时间最长的工程师,但他们在过去几个月中做出了重大贡献,您相信他们可以回答您的问题并在代码审查中提供宝贵的见解。有时,最资深的人可能不是最好的合作者;这些开发人员通常非常忙于领导他们自己的高要求项目,他们的时间在其他地方更宝贵。您的项目也可能是某人获得其直属团队之外的宝贵曝光和知名度的绝佳机会。重构(尤其是大规模重构)可能是一项棘手的任务,但这不是只有一年(甚至几个月)经验的工程师无法做出有意义的贡献和学习的任务。
It’s also important not to confuse expertise with seniority. Frankie might not be the engineer with the most industry experience or have the longest tenure at the company, but they’ve made significant contributions over the past few months and you’re confident they can answer your questions and offer valuable insights in code reviews. Sometimes, the most senior person might not be the best collaborator; oftentimes these developers are very busy leading demanding projects of their own and their time is more valuable elsewhere. Your project might also be a prime opportunity for someone to get valuable exposure and visibility beyond their immediate team. Refactoring (particularly refactoring at scale) can be a tricky endeavor, but it’s not one that engineers with just a year’s (or even some months’) experience can’t meaningfully contribute to and learn from.
如果你认为某个团队拥有一批优秀的专家候选人,我建议你直接与他们的经理交谈,这样他们就可以审查你向团队提出的请求,评估他们的兴趣,并帮助确定一些潜在候选人。向经理征求意见,从他们的团队中选择一两位专家,可以帮助你最大限度地减少你给招聘过程带来的偏见。
If you’ve highlighted a team as being a good set of expert candidates, I recommend talking with their manager directly so that they can vet your request to their team, gauge interest, and help identify a number of potential candidates. Asking the manager for their input in choosing one or two experts from their team can help you minimize the biases you bring to the recruitment process.
在本章中,我们花了很多时间讨论组建团队。但是您现有的团队怎么样?您是最适合承担拟议的重构工作的团队吗?要成为团队的技术主管,您必须了解为什么您的团队在组织环境中最适合承担该项目。通常有三种类型的团队承担大规模重构项目。
We’ve spent quite a bit of time in this chapter talking about forming a team. But what about your existing team? Are you the best-suited group to take on the proposed refactoring effort? To set yourself up for success as a technical lead for your team, you have to understand why your team is best positioned within the context of your organization to take the project on. There are generally three kinds of teams that undertake large-scale refactoring projects.
这类团队负责产品的某个特定部分,并重构其主要拥有或负责的代码。此代码与其他团队的代码在某些边界上交互。在这些边界上,他们必须弄清楚是自己进行更改,还是与他们交互代码的工程师协调进行必要的更改。
This kind of team owns a particular piece of the product and is refactoring code that it primarily owns or is responsible for. This code interfaces with other teams’ code at some number of boundaries. At those boundaries, they must figure out whether to make the changes themselves or coordinate with the engineers whose code they are interfacing with to make the necessary changes.
举例来说,假设您在一家拥有三大工程部门的公司工作:开发人员生产力、基础设施和产品工程。您是开发人员生产力部门中负责测试应用程序库和工具的团队的一员。虽然整个组织的工程师都在编写更多单元测试,这总是一件好事,但您担心运行所有单元测试所需的时间已经开始妨碍每个人快速交付代码的能力。考虑到性能,您开始跟踪各个单元测试的时间,收集有关某些操作(例如设置复杂的模拟状态)所需时间的指标。您的团队决定启动重构,重点关注加快模拟设置过程。虽然新版本的基准测试显示出显着的改进,但现有的单元测试需要迁移以使用新的设置逻辑才能从加速中受益。进行迁移的主要方法有两种:
Say, for instance, you work at a company with three broad engineering groups: developer productivity, infrastructure, and product engineering. You are on the team responsible for testing libraries and tooling for your application in the developer productivity group. While it’s always great that engineers across the organization are writing more unit tests, you’re worried that the amount of time required to run them all has begun to hinder everyone’s ability to ship code quickly. With performance in mind, you start tracking timings for individual unit tests, gathering metrics on how long certain operations like setting up a complex mock state take. Your team decides to kick off a refactor, focusing their efforts on speeding up the mock setup process. Although benchmarks for the new version show a drastic improvement, existing unit tests will need to be migrated to use the new setup logic to benefit from the speedup. There are two main ways to go about the migration:
第一种选择是让您的团队为每个人迁移测试。这种方法有一些明显的优势。您的团队最熟悉如何最好地将测试从旧的模拟逻辑迁移到新的模拟逻辑;您知道哪些类型的测试易于迁移、避免棘手测试的陷阱以及如何最大限度地利用新的模拟系统以获得最大的性能改进。您的团队可能也是最有动力执行迁移的团队。作为测试框架的所有者,您已决定这是首要任务。您可能已经设定了一些季度目标,以减少运行完整测试套件所需的时间。知道您将根据您的团队是否实现了该目标进行评估是非常激励人的(尤其是在接近季度末时)。
The first option is for your team to migrate everyone’s tests for them. This approach has some distinct advantages. Your team is the most familiar with how best to migrate a test from the old to the new mocking logic; you know which kinds of tests lend themselves to easy migrations, the pitfalls to avoid with trickier tests, and how to maximize usage of the new mocking system to reap the most performance improvements. Your team is likely also the most motivated to execute the migration. As owners of the testing framework, you’ve decided that this is a top priority. You’ve likely set some quarterly objectives around decreasing the amount of time required to run the full testing suite. Knowing you’ll be evaluated on whether your team achieved that goal is very motivating (especially when nearing the end of the quarter).
另一方面,有数千个测试需要迁移。您的团队可能会开发出一种巧妙的方法来使用代码修改工具自动迁移一些最简单的迁移,但这只能让您完成一小部分工作。如果您将剩余的调用站点平均分配给您的团队,您可能仍需要数周的手动重复工作才能将所有内容转移到新系统。您的团队也不太熟悉这些测试中的每一个实际测试的内容。尽管我们想假设测试将当前的模拟系统视为黑盒,但我们无法总是预测测试与现有实现的行为的紧密耦合程度。很有可能我们最终需要一些上下文来了解测试试图测试的功能(以及如何测试),以使其适应新的模拟系统。
On the flip side, there are thousands of tests to migrate. Your team might develop a clever way to use code modification tools to migrate some of the easiest migration automatically, but that would only get you a small percentage of the way to completion. If you divvied up the remaining callsites evenly across your team, it might still take you weeks of manual, repetitive work to move everything over to the new system. Your team is also not intimately familiar with what each of these tests is actually testing. As much as we’d like to assume that the tests treat the current mocking system as a black box, we can’t always predict how tightly coupled the tests might be to the behavior of the existing implementation. There is a strong chance that we will eventually need some context for what (and how) the test is attempting to test functionality to adapt it to use the new mocking system properly.
第二种选择是让产品工程组中的团队自己迁移与他们拥有的功能相关的测试。通过这种方法,您的团队不再需要独自处理数千个测试。通过在整个工程组织中分配工作,迁移的积极影响很有可能会更快地显现出来。您团队中的工程师也不必担心自己弄清一些比较棘手的测试是如何工作的。由于每个团队都负责更新自己的测试,因此他们可以更有效地保留测试的预期行为。(作为额外的好处,参与这项工作的团队有一个很好的机会来严格审查他们当前的测试覆盖率,甚至可能在运行时节省几秒钟之外对其进行改进。)
The second option is for the teams in the product engineering group to migrate the tests related to the features that they own themselves. With this approach, your team no longer needs to tackle thousands of tests alone. By distributing the work across the engineering organization, there’s a strong chance that the positive impact of the migration will be experienced much more quickly. Engineers on your team also don’t need to worry about deciphering how some of the trickier tests work on their own. With each team tasked with updating its own tests, it can do a much more effective job of retaining the intended behavior of the test. (As an added bonus, teams participating in the effort are given a great opportunity to review their current test coverage critically and maybe even improve it beyond shedding a few seconds at runtime.)
这种方法本身也有一些缺点。虽然您应该制作有关如何最好地升级测试的文档,但无论您的团队选择哪种选项,文档的初始质量(以及及时更新)在这种方法下变得更加重要。积极迁移测试的工程师将严重依赖您的团队来回答问题并进行代码审查。即使您有一份非常详尽的常见问题文档,您可能仍需要多次回答同样的几个问题。
This approach comes with a few drawbacks of its own. While you should produce documentation for how to best upgrade a test, regardless of which option your team chooses, the initial quality of (and timely updates to) the documentation becomes much more important with this approach. Engineers actively migrating their tests will rely heavily on your team to answer questions and be available for code reviews. Even if you have an exceedingly thorough document of frequently asked questions readily available, you’ll probably still have to answer the same handful of questions more than once.
尽管你希望说服足够多的工程师,让他们相信新系统的性能改进是值得的,但可能还是会有一些团队无法接受。一些团队可能会承诺迁移,但最终无法完成,因为构建新功能是优先级更高的任务。在鼓励其他团队参与重构时,即使每个人都认为重构的好处是切实而显著的,也要注意,除非这些团队同样做出承诺,并设定季度目标,否则你的项目将是第一个被搁置的项目之一。
Although you hope to convince enough engineers that the performance improvements of the new system are worth the effort, there will probably be a number of teams that fail to take the bait. A few teams might commit to the migration but ultimately fail to complete it because building new features was of higher priority. When encouraging other teams to participate in a refactor, even when everyone agrees that the benefits are tangible and significant, be mindful that unless these teams have committed equally, setting quarterly objectives for its completion, your project will be one of the first to be pushed aside.
这两种选择都不是完美的,但您选择的选择将影响您实现团队短期和长期目标的能力以及您与其他工程团队的关系。如果可能的话,我建议混合使用这两种策略,以最大限度地减少任何一种方法的缺点,并最大限度地提高成功完成重构的机会。例如,在我们的测试场景中,以下是我推荐的几个步骤。
Neither option is perfect, but the one you choose will have an impact on your ability to achieve your team’s short- and long-term goals as well as on your relationship with other engineering teams. If possible, I recommend mixing the two strategies to minimize the downsides of either approach and maximize your chances of completing the refactor successfully. With our test scenario, for example, here are a few steps I would recommend.
让您的团队确定一些可能从迁移中受益最多的简单测试。联系产品工程团队,获取他们认为影响最大的测试的更多背景信息。
Have your team identify a few simple tests that might benefit the most from the migration. Reach out to the product engineering teams to get additional context on which tests they deem to have the most impact.
从选项 1 开始(请参阅“选项 1:一个团队迁移所有测试”)。开始手动迁移测试并彻底记录该过程。(如果测试明确归特定团队所有,请提前通知该团队或与其合作完成迁移。)
Start with Option 1 (see “Option 1: One team migrates all tests”). Begin migrating the tests manually and document the process thoroughly. (If the tests are clearly owned by a specific team, either give that team a heads up or work with it to complete the migration.)
对于迁移的测试文件,运行基准测试以清楚地展示性能影响。并记录这些内容。
For the migrated test files, run benchmarks to demonstrate the performance impact clearly. Document those, too.
开发一个代码修改工具来自动迁移一些简单的用例。在测试套件的逻辑小部分上运行修改工具,直到所有候选测试都已迁移。
Develop a code modification tool to migrate a few simple cases automatically. Run the modification tool on small, logical subsections of the testing suite until all candidate tests have been migrated.
启动选项 2(参见“选项 2:团队更新自己的测试”)。通过强调其优势并向工程师指出迁移示例来宣传新的模拟系统。利用办公时间亲自回答问题并与工程师一起排除故障。考虑组织定期的即兴会议,届时整个组织的工程师可以加入您的团队来完成一些迁移。
Kick off Option 2 (see “Option 2: Teams update their own tests”). Evangelize the new mocking system by highlighting the benefits and pointing engineers to sample migrations. Spin up office hours to answer questions and troubleshoot with engineers in person. Consider organizing regular jam sessions, when engineers across the organization can join with your team to crank out a few migrations.
Work with teams to set quarterly objectives for improving the performance of their tests; if they’ve committed to being evaluated on their participation in the effort, the chances are better that the tests will get done.
一些大型工程组织有专门的团队来提高开发人员的工作效率。这些团队承担的工作范围非常广泛:他们提供和管理开发环境;他们编写编辑器扩展和脚本来自动执行重复性任务;他们构建工具来帮助开发人员更好地了解他们提议的代码更改对性能的影响;他们维护和扩展所有产品工程师所依赖的核心库(包括日志记录、监控、功能标记等)。通常情况下,在应用程序范围内继续与产品开发人员一起工作的开发人员生产力团队最终会承担清理人员的角色。
Some larger engineering organizations have teams dedicated to improving developer productivity. The range of the kind of work that these teams take on can be quite wide: they provision and manage development environments; they write editor extensions and scripts to automate repetitive tasks; they build tooling to help developers understand the performance implications of their proposed code changes better; they maintain and expand upon the core libraries all product engineers depend on (including logging, monitoring, feature flags, etc.). More often than not, the developer productivity teams that continue to work alongside product developers within the boundaries of the application end up taking on the role of the cleanup crew.
清理团队承担着一项重要(但往往吃力不讨好)的工作,即识别和清除代码库中的垃圾和反模式,并建立更好、更可持续的模式来代替它们。这些团队通常由非常关心代码健康的工程师组成,他们希望其他产品工程师能够轻松地开发、测试并最终发布新功能。看到公司的其他开发人员使用(并欣赏)他们的库和工具是他们最大的满足。
Cleanup crews take on the important (but often thankless) work of identifying and shedding cruft and antipatterns from a codebase and establishing better, more sustainable patterns in their stead. These teams are usually made up of engineers who care deeply about code health and want their fellow product engineers to have an easy time developing, testing, and ultimately shipping new features. Seeing other developers at the company use (and appreciate) their libraries and tools is what gives them the greatest satisfaction.
通常,这些团队进行大规模重构有两个原因。首先,团队对代码库的了解广度是无与伦比的。由于这些团队是核心功能库的所有者,因此他们往往至少接触过应用程序的几乎每个角落。对于在单片代码库内工作的团队来说尤其如此。其次,这些团队重视开发人人都能访问的人体工程学解决方案,无论团队或资历如何;他们拥有宝贵的经验,可以思考哪种界面可以在可扩展性和实用性之间取得适当的平衡。如果项目背后的主要驱动力是提高开发人员的 生产力(并保持这种状态),那么这就是完美的团队。第三个隐含的原因是,通过让清理人员规划和执行重构,产品开发团队可以继续专注于相对不受干扰的功能开发。
Typically, these teams take on hefty refactors for two reasons. First, the teams’ breadth of knowledge of the codebase is unparalleled. Because these crews are owners of core, functional libraries, they tend to have at least some exposure to almost every corner of an application. This is especially true of teams that work inside monolithic codebases. Second, the teams value developing ergonomic solutions that are accessible to everyone, regardless of team or seniority; they have valuable experience thinking about what kinds of interfaces strike the right balance between extensible and practical. If the main driving motivation behind the project is to boost developer productivity (and keep it there), then this is the perfect team. A third, implicit reason, is that by having the cleanup crew map out and execute the refactor, product development teams can continue to focus on feature development relatively undisturbed.
不幸的是,清理团队是不可持续的。当这些团队富有成效时,其他工程团队(通常是功能开发团队)就不会觉得有责任承担重要的维护工作。随着时间的推移,清理团队的工作量会变得难以承受,团队成员会逐渐精疲力竭。因此,这些团队通常寿命较短或流动率较高。此外,逃避维护工作的团队会逐渐失去与长期支持功能相关的肌肉记忆。对他们进行另一次大规模重构可能不是一个可行的选择。
Unfortunately, cleanup crews are not sustainable. When these groups are productive, other engineering teams, typically feature-development teams, feel less of a responsibility to commit to doing important maintenance work. Over time, the cleanup crews accrue an insurmountable amount of work, slowly burning out their team members. As a result, these teams are usually short-lived or have high turnover. Furthermore, the teams shirking maintenance work gradually lose the muscle memory associated with supporting features long term. Throwing another large-scale refactor their way might not be a viable option.
现在我们已经了解了我们的团队与重构项目之间的关系,确定了我们需要的专业知识,并集思广益列出了我们希望招募的相应专家名单,接下来就是最难的部分:说服他们帮助我们。虽然我们可能无法提供贝拉吉奥保险箱中 1.5 亿美元的十分之一,但我们可以尝试提出一个令人信服的论点,即为重构做出贡献非常值得他们投入时间和精力。不同的人对许多技术的反应不同,因此我们将在这里概述一些。
Now that we’ve gained context on the kind of relationship our team has with our refactoring project, determined the expertise we’ll need, and brainstormed a list of corresponding experts we hope to recruit, we come to the hard part: convincing them to help us. While we might not be able to offer one eleventh of the 150 million dollars contained inside the Bellagio safe, we can try to make a convincing argument that contributing to the refactor is well worth their time and effort. Different individuals respond differently to a number of techniques, so we’ll outline a few here.
不要害怕为单个专家(无论是团队还是个人)部署多种策略。最忙碌或最持怀疑态度的专家可能需要不止一个理由才能同意与您一起踏上旅程,这是理所当然的!作为任何角色的合作专家,您都同意将(可能很大一部分)宝贵的时间和精力分配给项目。如果重构伴随着重大风险(大多数都是这样),那么您就有可能卷入事故。如果可能会拖延一段时间,您可能不得不放弃出现的其他机会。参与大规模重构并非没有风险。您不应试图将这些风险最小化;相反,要努力让专家们看到好处明显大于风险。
Do not be afraid to deploy multiple tactics for a single expert (whether that’s a team or an individual). The busiest or the most skeptical experts will likely need more than just a single reason to agree to embark on the journey with you, and rightfully so! As a collaborating expert in any role, you are agreeing to allocate a (maybe significant) portion of your valuable time and energy to the project. If the refactor comes with significant risks (and most do), you are opening yourself up to involvement with incidents. If it’s likely to drag on for a while, you may have to pass up other opportunities as they arise. Getting involved with a sizable refactor does not come without its risks. You should not try to minimize those risks; instead, aim to make the experts see that the benefits decisively outweigh them.
最后,坚持本身就是一种技巧。如果你已经与名单上某一特定专业领域的每一位潜在专家交谈过,但没有人答应,那就再回去看看。前几位候选人将有更多时间考虑这个机会,而且你可能从迄今为止进行的许多其他对话中学到更多技巧。
Finally, persistence can be a technique on its own. If you’ve spoken to each of the potential experts for a given expertise on your list and haven’t gotten anyone to bite, loop back around. The first few candidates will have had more time to consider the opportunity and you’ll probably have a few more tricks up your sleeve from the many other conversations you’ll have had to date.
吸引工程师与吸引团队经理是不同的体验。工程师更接近代码;他们更具体、更敏锐、更频繁地体验到重构想要解决的痛点。根据我的经验,你很少需要花费大量(如果有的话)时间说服工程师你认为的问题是实际问题;他们通常确切地知道为什么你想要修复的痛点如此重要,因为他们自己在多个场合经历过同样的痛苦。对于工程师,你可能能够成功使用接下来章节中概述的大多数推销技巧(也许结合一些)。
Appealing to an engineer is a different experience from appealing to a team’s manager. Engineers are much closer to the code; they experience the pain points your refactor wants to address much more concretely, acutely, and frequently. In my experience, you very rarely have to spend considerable (if any) time convincing engineers that the problem you perceive is an actual problem; they often know exactly why the pain points you’re seeking to fix are so important to fix because they’ve experienced the exact pain on multiple occasions themselves. With engineers, you’ll probably be able to use most of the pitching techniques outlined in the upcoming sections successfully (maybe combining a few).
另一方面,管理者可能只会从次要角度感受到痛苦;例如,他们可能会注意到,由于代码复杂度的增加,工程师在冲刺规划期间建议的时间估计逐渐增加。在一对一会议上,一些工程师可能会对由于代码脆弱、测试不充分而频繁发生的事故感到沮丧。管理者通常也没有动力优先考虑重构而不是功能开发。这通常是因为管理者的衡量标准是其团队定期推出新产品创新的生产力。花费一两个季度来改进团队负责的代码,以便他们随后可以在未来几个季度加快开发速度,这对高层管理人员来说很难接受,因此除非迫切需要清理代码,否则管理者不会进行抵抗。在接下来提出的技术中,我建议主要依靠指标和讨价还价。
Managers, on the other hand, might only feel the pain from a secondary perspective; for example, they might notice a gradual increase in time estimates suggested by engineers during sprint planning due to an equal increase in complexity of the code. In one-on-ones, some engineers might express frustration with frequent incidents due to brittle, poorly tested code. Managers also often have no incentive to prioritize refactoring over feature development. This is usually because managers are measured on their team’s productivity in shipping net new product innovations at a regular cadence. Spending a quarter or two improving the code the team is responsible for so that they can subsequently speed up their development velocity in future quarters is a difficult sell for upper management, so managers don’t put up a fight unless there is a dire need for code cleanup. Of the techniques proposed next, I recommend leaning heavily on the metrics and bartering pieces.
您可以通过明确评估管理人员定义可衡量目标的能力以及支持团队实现这些目标的能力,确保管理人员有动力优先考虑团队的代码健康和质量。让高层管理人员同意将此作为一项重要的评估指标并不总是那么容易,但如果可以的话,它可以对您的工程组织构建和维护软件的方式产生巨大的影响。
You can ensure that managers are motivated to prioritize code health and quality on their team(s) by explicitly evaluating them on their ability to define measurable goals around it and supporting the team in achieving those goals. It’s not always easy to get upper management to buy into adding this as an important evaluation metric, but if you can, it can make a world of difference in how your engineering organization builds and maintains software.
在第 3 章中,我们探讨了在开始重构之前量化应用程序当前状态的各种方法。第 4 章讨论了如何制定全面的行动计划,并制定一套可靠的成功指标,这些指标是根据使用第 3 章中概述的方法进行的初始测量确定的。这些指标可以帮助您建立令人信服的论据,以获得重构工作的帮助。
In Chapter 3 we explored a variety of ways in which we could quantify the current state of the application before embarking on our refactoring journey. Chapter 4 discussed how to develop a thorough plan of action, complete with a solid set of success metrics determined from the initial measurements taken using the methods outlined in Chapter 3. These metrics can help you build a convincing argument for getting help with your refactoring endeavor.
通常,这类推销方式对持怀疑态度的专家和日常工作中数据驱动力最强的人最有效。这些工程师总是在问问题;他们积极监控其团队负责维护的 API 的 p95 响应时间;他们是第一个注意到特定分片上数据库操作平均数量上升的人。利用您自己的指标来吸引他们的分析能力,您可能会为自己找到一位新专家。
Typically, these kinds of pitches are most effective with the more skeptical experts and those who are most data-driven in their regular work. These are the engineers who are always asking questions; they actively monitor the p95 response times of APIs their team is responsible for maintaining; they’re the first ones to notice an uptick in the average number of database operations hitting a specific shard. Appeal to their analytical side with your own metrics and you might secure yourself a new expert.
首先,阐明为什么您选择的指标是问题的良好指标。花时间仔细解释您希望解决的问题之间的关系、您选择如何量化这些问题以及您收集的初始统计数据。首先选择简单的指标,然后用其他支持数据点来增强您的案例。如果您已经获得或生成了任何有助于说明问题的视觉效果,请参考它们;即使是我们认为是数字人的同事也会偶尔欣赏解释性图表。
First, articulate why the metrics you’ve chosen are good indicators of the problem. Take the time to explain the relationship carefully among the problems you hope to fix, how you choose to quantify them, and the initial statistics you’ve gathered. Choose simple metrics first, and then augment your case with additional supporting data points. If you’ve acquired or generated any visuals that help illustrate the problem, reference them; even those coworkers we think of as numbers people appreciate an explanatory graph or chart every so often.
将起始指标与您定义的成功指标并列,从期望的最终状态开始。之后,您可以引导专家了解整个工作过程中指标的演变,从开始到结束。强调您的成功指标明确表明重构将取得成功,并且它们足够雄心勃勃但可以实现。
Juxtapose the starting metrics with your defined success metrics, starting with the desired end state. Afterward, you can walk the expert through the evolution of the metrics throughout the effort, from start to finish. Emphasize that your success metrics decisively show that the refactor would be successful and that they are sufficiently ambitious but achievable.
有一种奇怪的认知失调现象,被称为本杰明·富兰克林效应:如果你请别人帮忙,比帮别人忙更容易让别人喜欢你。举个例子,查理请达科塔帮忙。达科塔很乐意帮忙。这种现象意味着,达科塔更有可能再帮查理一个忙,而不是查理帮他们一个忙。这个想法是,人们帮助别人是因为他们喜欢他们,即使他们实际上并不喜欢,因为他们的头脑很难保持他们的行为和感知之间的逻辑一致性。
There’s an odd cognitive dissonance known as the Benjamin Franklin effect: you have a better chance at getting someone to like you if you ask them for a favor than by doing a favor for them. To give an example, say Charlie asks a favor of Dakota. Dakota happily obliges. The phenomenon follows that Dakota is more likely to do another favor for Charlie than if Charlie had done one for them. The idea is that people help others because they like them, even if they actually don’t, because their minds struggle to maintain logical consistency between their actions and perceptions.
与您要改进的代码密切合作的工程师更有可能了解其痛点。他们可能认识至少几位其他工程师(无论是在他们的直属团队中还是在整个组织中),他们经常遇到这些相同的痛点。如果这位专家是那种在代码库的健康状况和围绕它的工程师士气方面能够把握脉搏的同事,那么他们很有可能对队友有很强的同理心,您可以成功吸引他们内心的利他主义。
Engineers working closely with the code you aim to improve are more likely to understand its pain points. They probably know at least a handful of other engineers (either on their immediate team or in the organization at large) that experience these same pain points regularly. If this expert is the kind of coworker that has a finger on the pulse when it comes to the health of the codebase and the engineering morale surrounding it, there is a strong chance that they have a great deal of empathy for their teammates and you can successfully appeal to their inner altruist.
向专家询问他们听到队友抱怨的事情。在心里(或写下)记下重构想要解决的具体痛点。一旦你对当前代码的困难表示同情,列出他们提到的每个问题并介绍你提出的解决方案。可能有一些问题你还没有明确的解决方案,这完全没问题!事实上,这正是你联系这位专家的原因;你寻求他们对你试图解决的问题的看法。向他们明确说明,这些是他们可以为项目提供的见解。最后,强调他们的贡献将切实使他们的同事的生活(至少是一点点)更愉快、更高效。指出重构的预期收益并总结成功指标(因为多方面的推销最终会更有力)。
Ask the expert about the things they’ve heard their teammates complain about. Make a mental (or written) note of the specific pain points that the refactor intends to fix. Once you’ve commiserated on the difficulties of the code in its current state, list each of the problems they mentioned and walk through your proposed solution. There might be a few problems that you don’t yet have an explicit solution for, and that’s perfectly all right! In fact, this is precisely why you reached out to this expert; you seek their perspective on the problems you’re trying to solve. Make it clear to them that these are the kinds of insights that they could provide to the project. Finally, emphasize that their contributions would concretely make their coworkers’ lives (at least a little bit) more pleasant and more productive. Point to the expected benefits of the refactor and summarize the success metrics (because a multifaceted pitch is ultimately a stronger pitch).
如果您要推销的专家正在寻找良好的职业发展机会或让工程组织其他部门更了解他的机会,那么大型重构项目可能是他们简历上的完美项目。在本章前面,我们提到,一些经理可能希望确定团队成员,他们既可以成为项目的资产,又可以在更广泛的工程组织中获得宝贵的知名度;如果他们为您提供了一些名字,请务必与他们交谈,了解这些人需要什么样的成长和知名度才能达到下一个水平。
If the expert you’re pitching is looking for a good career advancement opportunity or a chance to be more visible to other parts of the engineering organization, a large-scale refactoring project can be the perfect line item on their resume. Earlier in the chapter, we mentioned that some managers might want to identify team members who might both be an asset to the project and gain valuable visibility within the broader engineering organization; if they’ve provided you with a few names, make sure to have a conversation with them about what kind of growth and visibility these individuals need to get to the next level.
当您与专家坐下来时,请讨论他们正在寻找哪些类型的成长机会。希望工程师和他们的经理对他们需要表现出的行为或需要推动的项目达成一致,以在职业生涯中成长,但情况并非总是如此。如果您想果断说服工程师加入您的团队,同时为他们取得成功做好准备,那么花时间将经理的期望与工程师的期望结合起来是最好的方法。从综合意见中,花时间确定这位专家可以做出贡献的重构的几个关键部分,以展示他们正在寻找的关键特征。当您与他们会面时,带他们了解每个里程碑并强调他们可以做出的贡献。描述您希望这些贡献中的每一个如何帮助他们实现目标。注意保持开放的对话,并对他们的意见持开放态度。你不是站在他们的立场上,也不是他们的经理,所以他们对如何最好地取得成功的看法可能与你不同。
When you sit down with the expert, have a conversation about what types of growth opportunities they’re looking for. Hopefully the engineer and their manager are aligned on what behaviors they need to exemplify or projects they need to drive to grow in their career, but that is not always the case. If you want to convince the engineer decisively to join you, all the while setting them up for success, taking the time to coalesce the manager’s expectations with those of the engineer is the best approach. From the combined input, take the time to identify a few key portions of the refactor that this expert could contribute to in a way that demonstrates the key characteristics they’re looking for. When you meet with them, walk them through each of the milestones and highlight the contributions they can make. Describe how you hope each of these contributions can help them achieve their goals. Be careful to keep an open dialogue, and be open to their input. You’re not in their shoes, nor are you their manager, so their perspective on how they can best be set up for success might differ from your own.
如果其他方法都失败了,那就准备好讨价还价吧。讨价还价是获得成功完成项目所需资源的好方法,同时还要得到某种承诺。通常,讨价还价不会发生在你和另一位工程师之间,而是发生在你自己的经理和你寻求帮助的团队的经理之间。你做出的承诺可能有所不同;关键在于找出对方经理最看重什么,并找到你愿意提供的适当替代方案。以下只是几个例子:
If all else fails, be ready to barter. Bartering can be a great way to acquire the resources you need to finish your project successfully, with some sort of commitment in return. Typically, bartering doesn’t happen between you and another engineer but rather between your own manager and the manager of the team you’re seeking help from. The promise you make in return can vary; it’s all about finding what the other manager values most and finding an adequate alternative you’re happy to provide in exchange. Here are just a few examples:
假设您的团队有空缺人数,而您想要招募专家的团队急需额外的人员。如果您的组织允许,并且您愿意放弃一些可用的人员,您可以为团队提供所需的人员,以换取一两名工程师积极参与重构工作。
Say your team has an open headcount and the team you want to recruit experts from is in desperate need of additional headcount. If your organization allows it, and you are comfortable giving up some of your available headcount, you could provide the team with the headcount it needs in exchange for one or two engineers to contribute actively to the refactoring effort.
如果您的团队拥有兼容的功能所有权,您可以协商获得其他团队一直想要放弃的一些组件的额外所有权。通常,当团队的界限不明确或存在争议时,这些领域往往会变得完全无人拥有或经常在两个团队之间转移(这实际上导致它们无人拥有)。作为帮助的交换,您的团队可以同意在一段固定的时间内(几个季度或一年)完全拥有这些功能或组件。
On the off-chance that your teams have compatible feature ownership, you could barter taking additional ownership of some components the other team has been wanting to shed. Oftentimes when teams have unclear or debated boundaries, areas tend to become entirely unowned or tossed between the two teams frequently (which essentially leads them to be unowned). In exchange for help, your team could agree to own those features or components decisively for a set period of time (a few quarters or a year).
如果您的工程组织有共同的责任(完成一定数量的客户支持时间或参与访谈),您可以让您的团队在重构工作结束后的一段规定时间内承担专家团队的部分(或全部)责任。(理想情况下,您同意仅在项目完成或接近完成时才开始交流,因为任何时间的浪费只会使项目拖延,损害所有参与者的利益。)
If your engineering organization has communal responsibilities (completing a certain number of hours of customer support or participating in interviews), you can offer for your team to take on some (or all) of the expert team’s responsibilities for a defined period after the refactoring effort has wrapped up. (Ideally, you agree for the exchange to kick off only after the project has finished or when it is near completion, because any time taken away from it will only make it drag on, to the detriment of everyone involved.)
当两名工程师之间进行交易时,通常是交换主题专业知识;也就是说,您作为 SME 招聘的专家希望您作为 SME 为正在进行或未来的项目做出贡献。我还看到工程师同意交换代码审查、如果他们轮班的话承担额外的值班,或者同意代表专家记录和协助一定数量的事后分析。
When bartering takes place between two engineers, normally it’s an exchange of subject matter expertise; that is, the expert you’re recruiting as an SME wants you to contribute as an SME on an ongoing or future project. I’ve also seen engineers agree to trade code review, take on additional on-call shifts if they share a rotation, or agree to document and facilitate a certain number of postmortems on the expert’s behalf.
请注意,在以物易物的情况下,如果在重构工作期间优先级发生变化,任何一方都可能违背承诺。任何规模的公司重组都可能因管理层或功能所有权的变动而导致这些协议无效。经理或工程师离开公司或更换团队也会对任何预先安排的协议产生影响。重构持续的时间越长,协议因任何原因而失败的可能性就越大。
Be aware that with bartering, either party can fall through on their promise if priorities shift within the duration of the refactoring effort. Reorganizations at companies of any size can render these agreements void due to shifts in management or feature ownership. Managers or engineers leaving the company or switching teams can also have an impact on any prearranged agreements. The longer the refactor goes on, the greater the chance the agreement might fall through for whatever reason.
如果您无法说服每种专业领域的第一个人选,请不要担心!这就是为什么尽早集思广益寻找多个人选很重要。理想情况下,您可以为每种专业领域找到一位专家,如果您无法找到更多候选人,请考虑联系那些拒绝了该机会的人,以获取更多推荐;他们可能会给您一两个名字。
If you cannot convince the first name for each type of expertise, don’t worry! This is why brainstorming multiple names early on is important. Ideally, you can secure an expert for each kind of expertise, and if you have trouble coming up with more candidates, consider reaching out to those who’ve turned down the opportunity for any additional recommendations; they might be able to give you a name or two.
如果您无法找到一名专家来掌握您最初不需要的技能,请考虑暂停搜索,等到需要时再重新开始。如果之前犹豫不决的专家看到足够的进展,甚至看到初始指标出现积极变化的迹象,他们可能会被说服加入。重构有点像滚下雪山的雪球;随着它势头的增强,它影响的表面积越来越大,在接近完成时收集越来越多的资源。
If you cannot secure an expert for a skill that you won’t need initially, consider pausing the search and picking it back up once you reach the stage when it becomes necessary. Experts who were previously on the fence might be convinced to join if they see sufficient progress and maybe the hint of a positive shift in the initial metrics. Refactoring can be a little bit like snowballs rolling down a snow hill; as it gains momentum, it affects greater and greater surface area, gathering up more and more resources as it nears completion.
如果一切顺利,我们可能会说服我们推荐的每个人,并组建一支绝对最佳的团队来完成这项工作。恭喜你!不幸的是,理想的结果不太可能实现。你很可能无法组建完美的梦之队,这没关系。我们可以想办法有效地利用我们能获得的资源,并提供高质量的重构!在结束本章之前,我们将花一些时间探索一个现实的场景可能是什么样子,以及如何充分利用它。我们还将简要讨论如何处理最坏的情况:不得不独自行动。
If all the stars align, we might manage to convince everyone we’ve pitched and assemble the absolute best team for the job. Congratulations! Unfortunately, the ideal outcome is quite unlikely. There’s a strong chance you won’t be able to assemble your perfect dream team, and that’s all right. We can figure out a way to work effectively with the resources we can secure and deliver a quality refactor! Before we close out the chapter, we’ll spend some time exploring what a realistic scenario might look like and how to make the most of it. We’ll also briefly discuss how to handle the worst-case scenario: having to go it alone.
最现实的情况是,你最终会拥有少数几位忠诚的专家和队友。在经历快速增长的小公司中,每个人都身兼数职,每个工程师都忙得不可开交,因此你不太可能找到一位专家来满足你所需的每种专业知识。在规模更大、更稳定的公司中,由于组织界限和优先级,你可能很难让其他团队的人承诺帮助你;仅仅因为某人是某个领域的专家,而你需要了解背景才能成功完成重构,并不意味着这是该专家或该专家管理链的首要任务。
The most realistic scenario is one in which you end up with a small handful of committed experts and teammates. At smaller companies experiencing a great deal of growth, everyone wears more than one hat and every engineer has a full plate, so it’s unlikely you’ll be able to get an expert to fill each of your desired kinds of expertise. At larger, more stable companies, you might have a difficult time getting folks from other teams to commit to helping you out simply due to organizational boundaries and priorities; just because someone is an expert in something you’ll need context on to complete your refactor successfully doesn’t mean that it is that expert’s or that expert’s management chain’s top priority.
无论在开始开发之前你说服了谁,如果你已经成功召集了至少几名工程师的核心团队来负责项目的早期阶段,那么你就处于有利地位。毕竟,你开始的团队可能不是你最终的团队,因为完成前几个里程碑所需的支持和专业知识不一定是你在项目剩余部分所需的支持。一旦你展示了一些切实的进展,并且重构的好处对其他工程师来说变得更加明显,你很可能能够鼓励其他人加入你。
Regardless of who you were able to convince before kicking off development, you’re in a good spot if you’ve managed to gather a core team of at least a few engineers for the earliest portions of the project. After all, the team you start with might not be the team you end with, because the support and expertise you need to complete the first few milestones aren’t necessarily the support you’ll need for the remainder of the project. You might very well be able to encourage others to join you once you’ve shown some tangible progress and the benefits of the refactor become more visible to fellow engineers.
最糟糕的情况是,如果你无法获得任何额外的帮助,需要独自执行项目。现在,在我们开始探索如何充分利用这种情况之前,我想花点时间承认,如果你唯一的选择是独自执行大型跨职能重构,那么你可能要考虑根本不要这样做。如果工程组织对你的建议没有足够的信心,无法合理分配人员,而你联系的专家工程师也不相信,也许是时候重新考虑并加强你的理由了。否则,也许是时候考虑现在可能不是执行这个项目的正确时机。
The absolute worst-case scenario is if you aren’t able to secure any additional help and need to execute the project alone. Now before we start exploring how to make the best of this situation, I want to take a moment to acknowledge that if your only option is to execute a large, cross-functional refactor alone, you may want to consider not doing it at all. If the engineering organization is not sufficiently convinced by your proposal to allocate staffing properly, and the expert engineers you’ve reached out to are unconvinced as well, maybe it’s time to go back to the drawing board and strengthen your case. Otherwise, maybe it’s time to consider that perhaps now is not the right time to execute on this project.
如果您的经理、队友和其他一些工程师相信这项工作的重要性,但资源却不够,您可以考虑独自前进。但请注意,这不是一条容易的道路。独自工作可能会非常孤独。因为只有您自己,一步一步慢慢前进,所以您可能会觉得自己没有取得重大进展。您很少有机会与对项目状态有充分了解的其他人交流想法,而且每次您需要第二意见时,他们也不需要您跟进进度。
In the event that your manager, teammates, and a number of other engineers believe in the importance of the effort, but there simply aren’t enough resources to go around, you may consider moving forward alone. Be forewarned, however, that it is not an easy path. Working alone can be terribly isolating. Because it’s just you, slowly making progress one step at a time, it can feel like you aren’t making significant progress. You rarely have the chance to bounce ideas off other people who have substantial context on the state of the project and don’t need to be brought up to speed every time you need a second opinion.
从好的方面来说,你不必与任何人协调;你希望知道需要采取的步骤顺序,并且可以按顺序执行它们。不需要与任何人协调也可能是一个严重的缺点。你必须非常非常准确地跟踪你正在做的一切,并将这些信息公开,以便那些投入了你的努力但无法做出贡献的人可以判断你在项目中的进展情况。
On the plus side, you don’t have to coordinate with anyone else; you hopefully know the sequence of steps you need to take, and you can execute them serially. Not needing to coordinate with anyone else can also be a serious downside. You have to keep very, very good track of everything you are doing and make that information available publicly so that others who are invested in your effort but unable to contribute can gauge where you are on the project.
在对代码库进行大规模更改时,几乎不可避免地会发生一两起事故。虽然事后分析应该没有责任,但当只有一个人负责一个给定的项目时,你可能会觉得责任和后续补救的负担完全落在你身上,而不是一群人身上。
One or two incidents are nearly inevitable when making expansive changes to a codebase. While postmortems should be blameless, when there is only a single individual responsible for a given project, it can feel as though the burden of responsibility and subsequent remediation falls solely on you instead of on a group of folks.
如果您还没有看过 John Allspaw 开发的Etsy 事后分析流程,我强烈建议您看一下。他们的事件响应方法非常全面,可以促进工程组织内深思熟虑、有针对性的发展,同时保护个别工程师的心理安全。
I highly recommend taking a look at Etsy’s postmortem process developed by John Allspaw if you haven’t already. Their approach to incident response is quite thorough and promotes deliberate, focused growth within an engineering organization, all the while preserving individual engineers’ psychological safety.
我建议你找一个伙伴,也许是另一个同样被委以重任的人。这个人可以让你负责并激励你,就像你经常和朋友一起做瑜伽一样:你知道他们会在,因为你会在,反之亦然。你可以建立一个定期的节奏,见面并讨论你们各自项目迄今为止的进展情况。你们可以互相帮助,集思广益,解决棘手的问题,有时还可以互相审查代码。无论哪种方式,在艰难的道路上有一个人陪伴你是保持正轨的关键。
I recommend that you find a buddy, maybe someone else who has also been tasked to be the sole owner of a significant project. This person is there to keep you accountable and motivated, similar to how you might regularly meet up with a friend for yoga: you know that they’ll be there because you’ll be there, and vice versa. You can establish a regular cadence for meeting up and talking through the progress you’ve made to date on your respective projects. You can help each other brainstorm solutions to the tough problems, and, on occasion, review each other’s code. Either way, having someone there to keep you company on the tough road ahead is absolutely critical to staying on track.
在整个团队组建过程中,您需要磨练一项重要技能,以组建一支高效的团队:沟通。最优秀的沟通者可以通过说服合适的工程师加入并从第一天开始明确他们的参与期望来组建最好的团队。每位贡献者,无论是活跃的队友还是主题专家,都很清楚自己在更大努力中的角色和职责,并对自己实现既定 期望的能力充满信心。
You’ll need to hone one important skill throughout the entire team-formation process to build an effective team: communication. The best communicators can assemble the best teams by convincing the right engineers to join and setting clear expectations of their involvement from day one. Each contributor, whether they are an active teammate or a subject matter expert, is well aware of their role and responsibilities within the larger effort and feels confident in their ability to deliver on the stated expectations.
沟通在重构工作的剩余阶段仍然至关重要,尤其是当您开始更改代码库时。在下一章中,我们将讨论频繁、彻底更新的重要性,并探索在您的团队和受更改影响的人员之间建立和维护自由信息流的技术。
Communication continues to be of utmost importance throughout the remainder of your refactoring effort, especially as you begin to make changes to your codebase. In the next chapter, we’ll discuss the importance of frequent, thorough updates and explore techniques for establishing and maintaining a free flow of information between your team and those affected by your changes.
我的一个朋友,我们叫她 Elise,最近开始了一段建房之旅。在几个月的时间里,Elise 密切参与了这个过程的每一步。她与水管工、电工、木匠、瓷砖工以及无数进出她建筑工地的工人协调。这些专业人士都组成了紧密团结的团队,一点一点地让她的家焕然一新。
A friend of mine, we’ll call her Elise, recently embarked on a house-building journey. Over a period of several months, Elise became intimately involved with every step of the process. She coordinated with plumbers, electricians, carpenters, tile-layers, and countless crews cycling in and out of her build site. Each of these professionals worked in tight-knit teams, bringing her home to life, piece by piece.
时不时地,Elise 的一些朋友(比如我)会问她房子装修得怎么样。她会开始滔滔不绝地讲述浴室瓷砖的故事,拿出她考虑过的样品照片,详细描述了更换原有瓷砖需要打多少次电话,因为大多数瓷砖到货时都有裂痕。然后她意识到自己没有告诉我第二间浴室的计划,于是又讲了一系列新轶事。
Every so often, some of Elise’s friends, like me, would ask her how the house was coming along. She’d launch into an epic tale about the bathroom tiles, pulling out pictures of samples she considered, detailing the many phone calls required to replace the original batch when most of them arrived cracked. Then she’d realize she hadn’t told me about the plans for the second bathroom and pivot to a new set of anecdotes.
我确实喜欢听她讲述她的房子的进展情况,但 Elise 的非线性叙事加上艰辛的细节,对我(和她的许多其他朋友)来说有点难以接受。所以,经过几次交谈后,我们请她开了一个博客。在那里,她可以记录进展情况,包括图片和艰辛的细节,我们可以定期查看并在闲暇时随意浏览。我们找到了一种适合每个人的媒体来跟进施工进度的方法。
I did love hearing about how her house was coming along, but Elise’s nonlinear storytelling, coupled with the grueling detail, was a bit too much for me (and many of her other friends). So, after a few conversations, we asked her to start a blog. There, she could document the progress, complete with pictures and arduous detail, and we could periodically check in and casually browse at our leisure. We’d found a way of keeping up with the construction in a medium that worked for everyone.
Elise 在与施工队的日常沟通中采用直接、注重细节的方法,而在她的博客中则更多地采用全局观的方法。对于大型重构项目,您还必须从两个不同的角度管理沟通障碍:首先,在您自己的团队内(Elise 和她的施工队),其次,与外部利益相关者(Elise 和她的朋友)。在本章中,我们将讨论可用于让两个群体了解情况并保持一致的沟通技巧。我们将研究您应该为您的团队建立的重要习惯,以及培养高效团队的一些策略。然后,我们将研究您应该采取哪些措施来让团队之外的个人了解情况。我们还将讨论一些应对过于亲力亲为或不够亲力亲为的利益相关者的策略。
Elise has a direct, detail-oriented approach in her everyday communication with the construction crews, and more of a big-picture approach in her blog. With a large refactoring project, you have to manage communication hurdles from two distinct perspectives as well: first, within your own team (Elise with her construction crew), and second, with external stakeholders (Elise with her friends). In this chapter, we’ll discuss communication techniques you can use to keep both groups informed and aligned. We’ll look at important habits you should establish for your team, and some tactics for fostering a productive team. Then we’ll look at what measures you should be taking to keep individuals outside of your team in the loop. We’ll also discuss some strategies for coping with stakeholders that are either too hands-on or not hands-on enough.
本章中的想法旨在为您提供在团队中培养良好沟通习惯的蓝图。您的公司可能已经制定了有关大型跨职能软件项目的协调、跟踪和报告的完善做法。您的经理、产品经理或技术项目经理可能也有自己的想法,知道如何最好地组建团队以获得成功。我建议听取这些人的意见,阅读后面的想法,并拼凑出您认为对每个人都最有效的方法。希望在本章结束时,您将拥有一套新的工具,可用于下一个大型重构项目。
The ideas in this chapter are meant to give you a blueprint for developing strong communication habits on your team. Your company might already have well-established practices around the way large, cross-functional software projects are coordinated, tracked, and reported on. Your manager, product manager, or technical program manager may also have their own ideas of how best to set up your team for success. I recommend listening to these individuals, reading the ideas that follow, and piecing together something that you believe will work best for everyone. Hopefully by the end of this chapter, you’ll have a new set of tools ready to use for your next large refactoring project.
希望您团队内部的沟通已经很顺畅和频繁了。如果是这样,您的团队可能在正常工作日参与了许多交流。你们结对编程,互相审查代码,一起调试。您的团队可能还会每天举行站立会议和每周同步会议。我们中的许多人在参与这些互动时并没有考虑我们如何沟通。它们只是感觉是我们工作中例行的一部分,这是理所当然的。然而,其中一些互动可以更加刻意,以更好地支持长期、技术复杂的项目(例如大规模重构)。
Communication within your team is hopefully already low-friction and frequent. If so, your team is probably participating in a number of exchanges during a regular workday. You’re pair-programming, reviewing each other’s code, and debugging together. Your team might also have daily stand-up and weekly sync meetings. Many of us don’t think about how we’re communicating when we’re partaking in these interactions. They simply feel like a routine part of our job, as they should. However, some of these interactions could be made a little bit more deliberate to support longer-term, technically complex projects better (like a large-scale refactor.)
为了让您的团队继续前进,避免误解和其他失误,您应该从一开始就考虑实施一些沟通习惯。其中一些概念对于那些实践敏捷的人来说很熟悉,即使只是最低限度地实践。我们将研究高频习惯(即每日或每周)和低频习惯(即每月或每季度),这些习惯对于回顾您迄今为止取得的成就以及未来仍需努力的重要。
To keep your team moving forward, free from misunderstandings and other mishaps, there are a few communication habits you should consider implementing from the very start. Some of these concepts will be familiar to those who practice Agile, even minimally. We’ll look at both high-frequency habits (i.e., on a daily or weekly basis) and low-frequency habits (i.e., on a monthly or quarterly basis) that are important for taking a critical look back at what you’ve accomplished to date and what still lies ahead.
如果可能的话,我建议制定一项政策,即在会议期间不使用笔记本电脑并尽量减少使用手机。理想情况下,会议期间唯一应该使用笔记本电脑的人是那些积极参与的人,他们要么做笔记,要么在屏幕上分享内容。如果会议参与者正在值班或积极参与事件补救,那么拿出笔记本电脑就更好了。这项政策可能听起来有点死板,但我真的相信它可以使每个人受益。我发现,在没有电脑的情况下参加会议时,我的注意力会更加集中;我会更专心地倾听,提出更好的想法,而且在离开会议时,我常常会觉得会议很有成效。如果你想尝试一下,可以先在一两次会议上制定这项政策。你可能会发现这些会议更有效率,有时还会提前结束!
If possible, I recommend instituting a policy of no laptops and minimal phone usage during meetings. Ideally, the only people who should be using a laptop during a meeting are those actively participating by either taking notes, or sharing content on their screen. If a meeting attendee is on call or actively contributing to incident remediation, having a laptop out is more than fine. This policy might sound a bit rigid, but I truly believe that it can benefit everyone. I find that I maintain much better focus during meetings I attend without my computer; I listen more attentively, offer better ideas, and more often leave the meeting feeling that it was productive. If you’re curious to give it a try, start out by instituting the policy for just one or two meetings. You might find that they’re more productive and, on occasion, end earlier!
您的团队可能以多种独特的方式频繁沟通。在典型的工作日中,您可能会一起聊天、结对编程、审查代码和调试。还有一些更常规、更结构化的沟通方式,可以确保每个人都以良好的节奏进行沟通。我们将在这里概述其中几种,并描述它们如何发挥作用。
Your team is probably communicating pretty frequently in a number of unique ways. During your typical workday, you’re probably chatting, pair-programming, reviewing code, and debugging together. There are some more regular, structured means of communication that can be meaningful to make sure that everyone is checking in at a good cadence. We’ll outline a few here and describe how they can be valuable.
站立会议是让团队中的每个人都定期保持一致的好习惯。站立会议可以很好地强制您和您的队友在项目规划工具中更新任务状态。站立会议也是回顾过去 24 小时的好机会;您是否取得了足够的进展,还是应该向队友寻求帮助?根据您昨天学到的知识,您今天打算做什么?
Stand-ups are a great habit for keeping everyone on the team aligned at regular intervals. They can be a good forcing function for you and your teammates to update the status of your tasks within your project planning tool. Stand-ups are also a great opportunity to reflect back on the past 24 hours; have you made sufficient progress or should you reach out to a teammate for a helping hand? Given what you learned yesterday, what do you plan to do today?
每个团队对站立会议都有不同的处理方式。有些人喜欢面对面会议,每个人都聚集在自己的办公桌前,回顾前一天的工作进展。要求每个人每天在固定时间参加站立会议有其优势。它们为工程师提供了一个每日的锚点,他们可以围绕这个锚点来计划他们的工作。
Every team has a different approach to stand-ups. Some folks prefer an in-person meeting where everyone gathers around their desks and recites their progress from the day prior. Requiring everyone to be present for stand-ups at a regular time every day has its advantages. They provide engineers with a daily anchor point around which they can plan their work.
在处理大型软件项目时,安排一个指定时间来评估所取得的进展(无论进展有多小)是至关重要的。有时,工作量似乎太大,而能够专注于逐步前进可以让它感觉更容易实现。每日面对面的站立会议也为每个人提供了一个重要的面对面交流的平台。虽然站立会议看起来很单调,但如果你团队中的大多数人将大部分时间花在独立编程上,那么每日站立会议可能是团队为数不多的面对面交流之一。
When working on large software projects, having a designated time when you can take stock of the progress you’ve made, however small, is crucial. Sometimes the scope of the effort can seem overwhelming, and being able to focus on incremental steps forward can make it feel more achievable. Daily, in-person stand-ups also provide a forum for everyone to get important face time with one another. As monotonous as stand-ups might seem, if the majority of your team spends a significant portion of its time programming independently, daily stand-ups might be one of the few face-to-face interactions it has.
当我使用“面对面”这个词时,我指的是任何面对面的媒介。可以是在同一办公室里亲自见面,也可以是分散在世界各地并通过视频会议会面。重要的是,每个人都花时间远离干扰,互相看和听。
When I use the words “in-person,” I’m referring to any face-to-face medium. That can be physically in person in the same office or scattered throughout the world and meeting over video conference. The important piece is that everyone is taking the time to see and listen to one another away from distractions.
其他团队则更喜欢异步的沟通方式,他们依靠主要的协作平台(无论是 Slack、Discord 还是类似工具)来发布前一天工作总结。面对面站立会议的一个缺点是,它要求每个人每天在同一时间到场。对于高度分散的团队来说,面对面站立会议要么非常不方便,要么在时区众多的情况下几乎不可能进行。它们还会尴尬地打断工程师的早晨或下午工作,减少他们专注于手头任务的时间。
Other teams prefer an asynchronous way of catching up, relying instead on their main collaboration platform (whether that’s Slack, Discord, or a similar tool) to post a summary of their previous workday. One downside of in-person stand-ups is that they require everyone to be available at precisely the same time every day. For highly distributed teams, in-person stand-ups are either very inconvenient or nearly impossible amid a wide array of time zones. They can also awkwardly break up engineers’ mornings or afternoons and diminish the amount of time they have to focus deeply on a task at hand.
有效地重构代码通常需要高度集中注意力;您需要尝试解读当前实现的作用(通过阅读或运行相应的单元测试),然后根据这种理解确定改进它的最佳方法,最后设计改进的实现,复制初始解决方案的精确行为。大多数程序员需要连续几个小时不间断的时间才能进入所需的状态,以便在完成一项艰巨的任务时取得可衡量的进展。如果一名工程师 9 点上班,却在 10:30 被一场站立会议打断,他们可能甚至懒得开始一项任务,因为他们知道他们不会取得太大进展。
Effectively refactoring code usually takes acute concentration; you are trying to decipher what the current implementation is doing (by reading through it or running the corresponding unit tests) and then, from that understanding, determine the best way to improve it, and then, finally, craft the improved implementation, replicating the precise behavior of the initial solution. Most programmers need several consecutive hours of uninterrupted time to get into the headspace required to make measurable progress toward a difficult task. If an engineer gets into work at nine o’clock only to be interrupted by a stand-up at 10:30, they might not even bother to start on a task, knowing that they won’t make much progress.
为了最好地模拟站立会议,异步站立会议通常要求参与者在特定时间之前提供更新。例如,您的团队在每个工作日上午 10:30 之前异步提供更新。如果您是个早起的人,通常早上 8 点就到办公室,您可能会立即提供更新并一头扎进下一个任务。您的队友在开始工作时提交他们的更新。到上午 10:30,如果团队中有人还没有写任何东西,他们可能会得到经理的温和提醒,直到每个人都得到更新。
To best simulate a stand-up meeting, asynchronous stand-up usually requires the participants to provide an update by a certain time. Say, for instance, your team provides updates asynchronously by 10:30 a.m. every weekday. If you’re an early bird and typically get in to the office at 8 a.m., you might provide your update immediately and dive headfirst into your next task. Your fellow teammates submit their updates as they begin working. By 10:30 a.m., if anyone on the team hasn’t written anything yet, they might get a gentle nudge from your manager until everyone’s given an update.
在进行大规模重构时,您可以继续举行每日站立会议,但您可能需要在整个执行过程中重新审视其频率。例如,如果您的团队已经进入了具有高度并行化工作流的里程碑,那么每天就这些截然不同、松散相关的流程相互更新可能不是很好的时间利用方式。如果您的更新高度技术化且注重细节,您的大多数队友将不具备欣赏它们所需的详细背景。您可以考虑每周两次或在每周同步期间提供更全面的更新,而不是每天站立会议。
You can continue to hold daily stand-ups when working on a large-scale refactor, but you may want to revisit their frequency throughout its execution. For example, if your team has entered a milestone with highly parallelized workstreams, updating one another on these distinct, loosely related streams on a daily basis might not be a good use of time. If your updates are highly technical and detail-oriented, most of your teammates won’t have the granular context needed to appreciate them. Instead of a daily stand-up, you could consider providing more comprehensive updates twice a week or during a weekly sync.
每日站立会议旨在快速传达每个人的进度;它们并不是与团队进行定期沟通的万能方法。想想半小时站立会议。如果您的团队成员花费大量时间深入讨论他们的任务以及他们在站立会议期间要解决的问题,您应该考虑两个选择。第一个是要求他们在站立会议后继续讨论;如果对话不涉及团队的很大一部分,那么这样做应该没问题。您的第二个选择是开始主持每周同步。这个论坛应该为您的团队提供更多专门的时间来深入探讨他们最关心的话题。
Daily stand-ups are meant to convey a quick snapshot of everyone’s progress; they are not a one-size-fits-all means of regular communication with your team. Think about the half-hour stand-up. If your team members are spending a considerable amount of time discussing at great depth their tasks and the problems they’re solving during the stand-up, you should consider two options. The first is to ask them to continue their discussion after the stand-up; if the conversation does not involve a significant portion of your team, that should work just fine. Your second option is to begin hosting a weekly sync. This forum should give your team more dedicated time to dig into the topics most top of mind for them.
对于大规模的重构工作,由于受影响的范围可能相当大,因此整个组织的许多工程师通常会参与其中。当团队的职能高度交叉,并非所有成员都会将 100% 的时间投入到重构中时,每周同步通常是比每天站立会议更好的选择。通过每周半小时或一小时的会议,团队成员可以专注于讨论与重构相关的更新。
With large refactoring efforts, because the affected surface area can be quite substantial, a range of engineers from across the organization will typically be involved. When the team is highly cross-functional, with not all members devoting 100 percent of their time to the refactor, a weekly sync is usually a better option than a daily stand-up. With a weekly half-hour or one-hour meeting, the team members can focus on discussing only the updates that are pertinent to the refactor.
我建议每周安排一个小时左右的时间进行同步。你可以像站立会议一样安排每周同步,但需要做一些调整。在会议的前半部分,让每个人轮流分享他们在过去一周的重构工作中取得的成就。如果你预计会取得更多进展,你应该假设原因:你遇到了障碍吗?其他非重构工作是否占据了中心位置?团队了解阻碍项目进展的原因与了解每个人的工作同样重要。这样,如果需要重新分配工作以保持项目向前发展,团队可以立即发现并相应地进行调整。当你在房间里走动时,记下人们可能想要更详细讨论的任何主题。
I would recommend budgeting about an hour for a weekly sync. You can structure a weekly sync as you would a stand-up, with a few tweaks. For the first half of the meeting, have everyone take turns sharing what they’ve accomplished in the refactor over the previous week. If you expected to make more progress, you should hypothesize about why that is: Did you run into roadblocks? Did other, nonrefactor work take center stage? It’s just as important for the team to know what’s holding up the project as it is to know what everyone’s working on. This way, if work needs be redistributed to keep the project moving forward, the team can spot it right away and pivot accordingly. As you go around the room, make note of any topics that folks might want to discuss at greater length.
在会议的后半部分,花点时间讨论任何重要主题。您可以在一周内收集这些主题,并在每周同步时带着完整的议程。例如,也许队友在测试期间发现了一个新的边缘情况。虽然这可能在某个时候的站立会议上讨论过,但您可能希望在每周同步期间进一步讨论这个边缘情况,并让团队有机会修改推出方法,以确保正确处理类似的边缘情况。
For the second half of the meeting, take the time to discuss any important topics. You can gather these topics throughout the week and come to the weekly sync with a complete agenda. Maybe, for example, a teammate discovered a new edge case during testing. Although this was probably discussed in a stand-up at some point, you may want to discuss the edge case further during the weekly sync and give the team a chance to amend the rollout approach to make sure similar edge cases are properly handled.
您还可以在每个人更新时收集讨论主题,留意任何有趣的主题。例如,队友可能提到花时间设计一种自动化重构中重复性部分的方法。团队中的其他人可能会从了解有关此原型的更多信息以及如何自己利用它中受益。与往常一样,练习良好的会议礼仪,并确保每个人都有机会分享自己的想法。
You can also gather discussion topics during everyone’s updates, keeping an ear open for any intriguing subjects. For instance, a teammate might have mentioned spending time prototyping a way to automate the more repetitive portions of the refactor. Others on the team might benefit from learning more about this prototype and how they can leverage it themselves. As always, practice good meeting etiquette and make sure that everyone has an opportunity to share their thoughts.
强大的团队是通过强大的联系建立起来的,而强大的联系是通过有意义的面对面互动建立起来的。每周同步是巩固您与队友关系的完美论坛。为什么建立一支强大的团队如此重要?当事情变得艰难时,拥有一支相互支持的团队会特别有帮助。例如,如果团队中的某个人发布了导致严重倒退的更改,那么知道团队会支持他们,并且有一两个队友会很乐意加入进来帮助解决问题,可以大大减轻他们的焦虑,从长远来看,可以防止倦怠。当工作开始拖延时,能够互相支持也非常重要。
Strong teams are built through strong connections, and strong connections are built through meaningful in-person interactions. Weekly syncs are the perfect forum for solidifying your relationship with your teammates. Why is building a strong team so important? Having a team that supports one another can be particularly helpful when the going gets tough. For example, if someone on the team ships a change that causes a serious regression, knowing that the team has their back and one or two teammates will be happy to jump in to help resolve the issue can substantially reduce their anxiety and, in the long run, prevent burnout. Being able to show up for one another is also really important when the work starts to drag.
大多数大规模重构都有相当大的里程碑,由繁琐、重复的工作组成。(本书迄今为止的所有示例都有一两个冗长、单调的步骤。)这些里程碑往往不是特别具有挑战性或吸引力;它们虽然枯燥但必不可少。当团队需要执行这些阶段时,通常项目开始感觉好像已经慢了下来。在这些阶段,队友更容易精疲力竭,但拥有一群你可以依靠、可以分享挫折感的人,可以带来很大的不同。如果团队中的某个人很难找到继续前进的能量,也许其他能力更强的人可以介入并伸出援助之手。
Most large-scale refactors have sizable milestones consisting of tedious, repetitive work. (All of the examples in this book to date have one or two lengthy, monotonous steps.) These milestones tend not to be exceptionally challenging or engaging; they’re dull but necessary. When the team needs to execute these stages, usually the project starts to feel as though it has slowed to a crawl. Teammates can be more prone to burnout during these stages, but having a group of individuals you can lean on, with which you can share your frustrations, can make a world of difference. If someone on the team is having a difficult time finding the energy to continue, maybe someone else with a bit more capacity can step in and lend a helping hand.
务必在每周同步期间做笔记,这样你就可以记录下讨论的所有内容(以及团队得出的任何结论)。这些笔记与你在项目管理软件中跟踪的任务相结合,当你需要快速参考团队在下一次回顾中取得的所有成果时,将会很有帮助。
Be certain to take notes during your weekly sync so that you have a record of everything that was discussed (and any conclusions drawn by the team). These notes, combined with the tasks you’ve been tracking in your project management software, will be helpful when you need a quick reference of everything that’s been achieved by the team for your next retrospective.
每周同步可以与站立会议相结合,或完全取代站立会议。根据我的经验,我发现即使是每天或每周两次站立会议,每周进行一次团队同步也非常有益,因为它为每个人提供了一个开放的论坛,可以更深入地讨论本周最重要的主题。如果您的团队选择异步站立会议,我特别建议举行每周同步;这样,每个人都有机会定期进行面对面的互动。尝试不同的站立会议形式(异步或面对面、每天或每隔一天),结合每周同步,看看哪种方式最适合您的团队。
Weekly syncs can be combined with stand-ups, or replace stand-ups entirely. In my experience, I’ve found that even with daily or twice-a-week stand-ups, having a weekly team sync is incredibly beneficial because it gives everyone an open forum to discuss the week’s most important topics in greater depth. I would especially recommend holding a weekly sync if your team opts for asynchronous stand-ups; this way, everyone has an opportunity to interact in person on a regular basis. Try out different variations of stand-ups (asynchronous or in-person, daily or every other day), combined with a weekly sync, and see what works best for your team.
回顾会议对于执行大规模重构的团队和敏捷产品开发团队同样有益。回顾会议为您的团队提供了一个重要的机会来反思最新的迭代周期、突出改进机会并确定您可以采取的任何行动。留出时间讨论哪些方面做得好、哪些方面可以做得更好以及您计划进行哪些改变,这是团队作为一个整体和个人成长的重要组成部分。
Retrospectives are just as beneficial to teams executing on a large-scale refactor as they are to Agile product development teams. They give your team an important opportunity to reflect on the latest iteration cycle, highlight opportunities for improvement, and identify any actions you can take moving forward. Setting time aside to discuss what went well, what could have gone better, and what you plan to change is an essential part of growing a team as a unit and as individuals.
绝大多数敏捷开发团队都会以不同的节奏定期参加回顾会议。一些以产品为中心的团队会在推出新功能或完成一定数量的开发周期后举行回顾会议。从事长期项目的团队可能会每月或每季度举行一次回顾会议。大型、大规模的重构通常最能从主要里程碑结束时的回顾会议中受益。这些回顾会议通常足够长,可以考虑大量内容,但又不会太大,以至于团队无法记住自上次回顾会议以来发生的所有事情。有时,单个里程碑内的较小子任务可能值得关注,足以证明它们自己的回顾会议是合理的;对于所有团队和所有重构来说,没有完美的、一刀切的答案。如果您倾向于认为回顾会议是值得的,只需询问您的团队是否同意。如果同意,安排一次;如果不同意,只需等到团队完成下一组实质性工作。
The vast majority of Agile development teams participate in regular retrospectives at different cadences. Some product-focused teams will hold a retrospective (a retro) after the launch of a new feature or after a set number of development cycles. Teams working on longer-term projects might hold a retro once a month or once a quarter. Large, at-scale refactors typically benefit most from retrospectives at the end of major milestones. These are usually lengthy enough to have substantial content to consider, but not so large that the team has trouble remembering everything that’s unfolded since the last retrospective. On occasion, smaller subtasks within a single milestone may feel notable enough to justify a retro of their own; there is no perfect, one-size-fits-all answer for all teams and all refactors. If you’re inclined to think a retrospective is worthwhile, simply ask your team whether it agrees. If it does, schedule one; if it doesn’t, simply wait until the team’s completed the next substantial set of work.
如果你对自己能否做好回顾没有信心,那么有很多公开资源可以帮助你。Atlassian网站上有很多文章和博客文章,概述了最佳实践并探索了为回顾增添趣味的原创想法。
If you aren’t confident in your ability to run a good retrospective, there are plenty of publicly available resources to help you. Atlassian has quite a few articles and blog posts on its website, outlining best practices and exploring original ideas for spicing up your retros.
与任何大型软件项目一样,团队之外的相当一部分人会对您的进展感兴趣。这可能包括高层管理人员、受影响团队的工程师或高级技术领导。高层管理人员将希望检查项目,以确保重构以预期的速度进行并产生预期的结果。从他们的角度来看,大规模的重构工作很容易变成无底洞:宝贵而昂贵的工程时间花在重写已经存在的功能上,如果项目偏离了计划,时间和财务投资只会增加。正如我们在第 5 章中讨论的那样,当管理人员必须权衡重构与进一步的功能开发时,还存在机会成本的问题。高层管理人员希望定期得到保证,确保其投资重构的决定是正确的,如果在任何时候做出相反的决定,他们可能会制定暂停或完全停止重构的计划。您可以通过在向团队外部提供更新时磨练您的沟通技巧来确保它继续支持您的工作。
As with any large-scale software project, a fair share of individuals outside your team will have an interest in your progress. This could include upper management, engineers on affected teams, or senior technical leaders. Upper management will want to check in on the project to ensure that the refactor is progressing at the expected pace and producing the expected results. From its perspective, large refactoring efforts can easily turn into money pits: valuable, expensive engineering time is spent rewriting functionality that already exists, and if the project strays, the time and financial investment only increases. There’s also the matter of opportunity cost, as we discussed in Chapter 5, when managers have to weigh the refactor against further feature development. Upper management will want to be reassured regularly that its decision to invest in the refactor was a good one, and if at any point it determines otherwise, it will likely hatch plans for either pausing or stopping the refactor altogether. You can make sure that it continues to support your effort by honing your communication skills when providing updates outside of your team.
受重构影响的团队的经理和工程师将希望跟踪项目的每个阶段,以评估他们何时承担其影响的风险。他们希望确切地知道团队预计何时推出相关变更以及预计需要多长时间。与此同时,高级技术领导者将留意任何挫折,将其作为帮助引导项目回到正确方向的机会。通常,这些人在塑造公司的技术愿景方面发挥着重要作用,并负责确保复杂、重要的技术工作取得成功,包括任何大规模重构。
Managers and engineers on teams affected by the refactor will want to keep track of each stage of the project to gauge when they risk bearing its effects. They’ll want to know precisely when the team expects to roll out relevant changes and how long it anticipates it will take. Meanwhile, senior technical leaders will be on the lookout for any setbacks as an opportunity to help steer the project back in the right direction. Typically, these individuals play an important role in shaping the company’s technical vision and are responsible for ensuring that complex, important technical endeavors succeed, including any large-scale refactors.
在本节中,我们将讨论如何确保所有外部利益相关者了解重构的最新进展。我们首先会介绍一些可以提前完成的工作,以便尽早养成良好的习惯,然后会介绍如何在整个项目执行过程中保持与外部的沟通。
In this section, we’ll discuss how you can ensure that all your external stakeholders stay up to date with the latest progress on your refactor. We’ll first look at some work you can do upfront to set good habits early, and then we’ll look at how you can keep up with external communication throughout the project’s execution.
当您开始重构时,您需要就如何与外部利益相关者进行沟通做出一些重要的初步决定。通过尽早做出这些决定,您将帮助您的团队在与团队外部的同事进行协调时节省宝贵的时间,并降低与外部各方沟通不畅的总体可能性。
There are some important preliminary decisions you’ll want to make about how you plan to communicate with your external stakeholders when you kick off your refactor. By making these decisions early, you’ll help your team save valuable time when coordinating with colleagues outside of your team and decrease the overall likelihood of any miscommunication with external parties.
即使是规模最小的公司也会使用多种工具来完成同一套任务。您的公司可能同时使用 GSuite 和 Office 365,而有些部门更喜欢其中一种产品。即使在您自己的工程组织内,您的文档也可能分散在 GSuite、GitHub 和内部 wiki 中。对于搜索有关产品功能或正在进行的项目的信息的人来说,不得不在六个平台上搜索脱节的信息是令人恼火的。当相关信息位于多个位置,并且信息不一致时,情况会更加令人沮丧。
Even the smallest companies use a number of tools for the same set of tasks. Your company might use both GSuite and Office 365, with some departments preferring one product over another. Even within your own engineering organization, you may have documents speckled across GSuite, GitHub, and an internal wiki. As someone searching for information about a product feature or in-flight project, having to search half a dozen platforms for disjointed pieces of information is aggravating. It can be even more frustrating when pertinent information is in more than one location, and the information doesn’t agree.
当您开始重构时,请选择您的团队喜欢使用的平台来收集与项目相关的所有文档。由于您将定期创建新文档并更新现有文档,因此您需要选择具有所有您喜欢的附加功能的解决方案。如果每次需要添加新内容时您都感到烦恼,那么您就不太可能这样做,并且文档也会过时。
When you kick off your refactor, choose a platform your team enjoys using to collect all documentation related to the project. Because you’ll be regularly creating new documents and updating existing ones, you’ll want to choose the solution that has all your favorite bells and whistles. If you’re annoyed every time you need to add something new, you’ll be less likely to do it, and the documentation will fall out of date.
在您选择的平台内,创建一个目录来存放所有相关文档;这将作为您的唯一事实来源。文档可以包括技术设计规范、您在第 4 章中制定的执行计划、会议记录、事后分析等等。无论其他工程师在哪里寻找文档,都可以链接到您的目录,或者更好的是,链接到其中的特定文档。如果您的同事有在 GitHub 上搜索技术文档的肌肉记忆,但您更喜欢用 Notion 写作,请在 GitHub 中为您的文档创建一个条目,并将其直接链接到您的 Notion 条目。这样,不仅您的文档很容易找到,而且您可以确定没有任何过时的副本在流传。
Within your chosen platform, create a directory to house all pertinent documentation; this will serve as your single source of truth. Documentation can include technical design specifications, the execution plan you developed in Chapter 4, meeting notes, postmortems, and so on. Wherever other engineers look for documentation, either link to your directory or, better yet, link to a specific document within it. If your colleagues have muscle memory from searching GitHub for technical documentation but you prefer writing in Notion, create an entry for your documentation in GitHub and link it directly to your Notion entry. This way, not only will your documentation be easy to find, you’ll be certain that there aren’t any outdated copies floating around.
当您的团队在整个项目执行过程中生成文档时,请确保所有文档都位于您的项目目录中(并包含来自其他广泛使用的文档源的更新的外部链接)。
As your team generates documentation throughout the execution of the project, make sure that it all lands in your project directory (with updated external links from other widely used document sources).
接下来,您需要与外部利益相关者设定期望。许多利益相关者会定期与您联系,了解新信息。不幸的是,利益相关者越多,这种模式就会变得非常麻烦。如果每次高层管理人员对重构的进展情况提出疑问时,您或您的经理都会收到电子邮件或消息,那么不久之后,您就会花费大量时间来回答这些请求。另一方面,您可能希望某些利益相关者定期联系,但不幸的是他们没有这样做。在这种情况下,您的团队必须发布信息。持续需要主动向众多利益相关者传播信息可能会令人恼火,特别是如果接收方不承认他们已经阅读了您提供的信息。
Next, you’ll want to set expectations with external stakeholders. Many of these stakeholders will regularly check in with you, polling for new information. Unfortunately, this model can become quite bothersome the more stakeholders you have. If you or your manager receives an email or message every time someone in upper management has a question about how the refactor is progressing, before long, you’ll end up spending quite a bit of time answering these requests. On the other hand, there may be some stakeholders that you wish would check in periodically but unfortunately do not. When this is the case, your team must push information out. Consistently needing to propagate information proactively out to numerous stakeholders can get irritating, particularly if the receiving party doesn’t acknowledge that they’ve read the information you provided.
不要回答每个请求或单独联系每个利益相关者,而是花一些时间确定您打算如何沟通进展,并尽早与利益相关者设定期望,即他们应该在何处以及以何种频率期待这些沟通。当利益相关者打破 您建立的模式时(例如,您收到来自越级领导的 ping),不要直接提供信息,只需回复并温和地提醒他们可以在哪里找到他们需要的东西。
Instead of either answering each request or contacting each stakeholder individually, spend some time determining how you intend to communicate progress and setting up expectations with your stakeholders early about where and with what frequency they should expect these communications. When stakeholders break from the patterns you’ve established (e.g., you get a ping from your skip-level), instead of providing the information directly, simply reply with a gentle reminder of where they can find what they need.
开始重构时,请花一些时间起草一份粗略的沟通计划。该计划应包括以下信息:
When you kick off the refactor, take some time to draft a rough communication plan. This plan should include information about the following:
您可以在多个地方让外部各方轻松访问这些信息。如果您的团队使用 Slack,您可以创建一个频道来容纳与重构相关的对话,并将频道的主题设置为项目当前阶段的简短描述。在周末,发布每周总结消息,详细说明过去几天的进展情况。(如果您每周举行同步会议,您可以在会后立即起草此消息并链接到您的会议记录。)如果您的团队使用 JIRA,请提供项目板的链接。对于需要定期、高级更新的利益相关者,请考虑添加一个摘要字段,团队每周在顶级项目上更新该字段。
There are a number of places where you can make this information easily accessible to external parties. If your team uses Slack, you can create a channel to house conversations pertinent to the refactor and set the channel’s topic to a short description of the current stage of the project. At the end of the week, post a weekly round-up message detailing the progress made over the past few days. (If you hold weekly sync meetings, you can draft this message immediately afterward and link to your meeting notes.) If your team uses JIRA, provide a link to the project board. For stakeholders who need regular, high-level updates, consider adding a summary field that the team updates weekly on the top-level project.
您可以在项目文档目录的根目录下、直接在沟通计划本身内或作为执行计划的子部分包含高级项目时间表。如果项目进展过程中任何日期发生变化,请务必更新此时间表。
You can include a high-level project timeline at the root of your project documentation directory, directly within the communication plan itself, or as a subsection to your execution plan. Make sure to keep this timeline updated if any dates end up shifting as the project progresses.
您可以在此处链接到您的团队打算起草与重构相关的文档的目录。简要介绍一下团队计划在此处汇总的文档类型。
Here, you can link to the directory where your team intends to draft documentation related to the refactor. Provide a short summary of the kinds of documents the team plans to aggregate there.
在某些情况下,公司中的个人要么无法在提供的资源中找到他们需要的信息,要么宁愿直接提出问题,而不是自己查找信息。当这种情况发生时,您需要确保他们知道去哪里。如果您的团队使用 Discord,请将他们引导至项目频道或专门设置一个用于提问的频道。如果您的团队依赖电子邮件并且有一个电子邮件组,请让成员向整个团队发送电子邮件,而不是向个人发送电子邮件。如果您的团队是跨职能的,请为所有参与者设置一个电子邮件组,并将问题直接发送到该组。
There will be instances when individuals across the company will either not be able to find the information they need in the provided resources, or prefer asking the question directly rather than locate the information for themselves. When that happens, you’ll want to make sure that they know where to go. If your team uses Discord, either direct them to the project channel or set up a channel exclusively for questions. If your team relies on email and it has an email group, have the members send an email to the team as a whole rather than to an individual. If your team is cross-functional, set up an email group for everyone involved and direct questions to that group.
在与可能受到重构影响的团队进行协调时,您需要保持高度的透明度。您需要确保这些团队中的任何人都不会对您的团队所做的工作感到惊讶或沮丧。为了确保每个人都在同一页面上,请提供一份您在与其他团队的代码交互时打算遵循的指导方针列表。这可能包括在修改其团队负责的代码时标记该团队中的一个或多个个人进行代码审查,或参加他们的站立会议以在与他们的团队相关时提供有关重构的更新。
When coordinating with teams that risk being affected by the refactor, you want to maintain a high level of transparency. You want to make sure that no one on these teams is surprised or set back by the work your group is doing. To make sure that everyone is on the same page, provide a list of guidelines you intend to follow when interacting with other teams’ code. This could include tagging one or more individuals from that team for code review when modifying code for which their team is responsible or attending their stand-up to provide updates on the refactor when pertinent to their team.
在项目执行期间,您的团队应该考虑养成一些沟通习惯。这些策略可以帮助公司中的每个人了解您的进度,同时最大限度地减少您的团队需要进行的主动对外沟通。我们还将讨论在寻求团队外部工程师对项目的专业知识时如何最好地与他们互动。
There are a few communication habits your team should consider adopting during the project’s execution. These strategies can help everyone at the company stay informed of your progress while minimizing the amount of proactive outward communication your team needs to do. We’ll also discuss how best to engage with engineers external to the team when seeking out their expertise about the project.
进度公告不仅对于让每个人都知道您已经完成了另一个里程碑(并因此获得了许多好处)很重要,而且对于继续让您的团队感到富有成效并提高他们的士气也至关重要。大规模重构可能会让团队感到畏惧,即使对于习惯于处理漫长项目的团队也是如此。庆祝每个里程碑的完成有助于每个人在整个项目期间感受到成就感。
Progress announcements are not only important to let everyone know that you’ve completed another milestone (and unlocked any number of benefits as a result), they are also crucial in continuing to make your team feel productive and boost their morale. Large-scale refactors can feel daunting for teams, even for teams accustomed to working on lengthy projects. Celebrating each milestone as it wraps up helps everyone feel a sense of achievement throughout the duration of the project.
无论贵公司如何宣布推出新功能,无论是部门范围内的电子邮件还是 Slack 频道中的消息,都请询问是否提供重构的重要进度更新。您的团队将因其辛勤工作而获得重要认可,并向广大受众证明重构是一项有价值的工程投资。
However your company announces the launch of new features, whether that’s a department-wide email or a message in a Slack channel, inquire about providing important progress updates for the refactor. Your team will get important recognition for their hard work and demonstrate to a wide audience that refactoring is a valued engineering investment.
在第 4 章中,我们学习了如何为大规模重构起草有效的执行计划。我们可以将此计划不仅仅用作简单的路线图,还可以将其用作记录整个项目进展情况的地方。复制原始执行计划。除了根据需要对估算和里程碑指标进行小幅更新外,原始版本应保持相对不变。该副本将作为原始文档的动态版本,并应随着项目的发展逐步更新。(启用版本历史记录将使您能够轻松回到过去并将初始值与最新更新进行比较。)这可能包括遇到的任何奇怪的错误、发现的意外边缘情况或计划中的偏差。原始执行计划的第二个版本应该让任何利益相关者对您的进度有更细致的了解,并帮助您的团队更好地跟踪迄今为止所取得的工作成果。
In Chapter 4, we learned how to draft an effective execution plan for our large-scale refactor. We can go beyond using this plan as a simple road map and use it as a place to document our work throughout the project’s progression. Make a copy of the original execution plan. The original version should remain relatively untouched beyond light updates to estimates and milestone metrics as necessary. The copy will serve as a living version of the original document and should be progressively updated as the project develops. (Enabling version history will give you the ability to easily go back in time and compare your initial values with your latest updates.) This could include anything from strange bugs encountered, unexpected edge cases uncovered, or diversions in the plan. This second version of your original execution plan should give any stakeholders a much more nuanced view into your progress and help your team keep better track of the work it has achieved to date.
例如,在第 4 章的示例中,Smart DNA 的软件团队负责将所有 Python 2.6 环境迁移到 Python 2.7。我们复制了团队执行计划的第一个里程碑,如下所示:
For instance, in our example from Chapter 4, the software team at Smart DNA was tasked with migrating all Python 2.6 environments to Python 2.7. We’ve copied over the first milestone of the team’s execution plan as follows:
创建单个requirements.txt文件。
指标:依赖项的不同列表数量;开始: 3;目标: 1
预计: 2-3 周
子任务:
枚举每个存储库中使用的所有包。
审核所有软件包并将列表缩小到仅包含相应版本的必需软件包。
确定在 Python 2.7 中每个包应该升级到哪个版本。
Create a single requirements.txt file.
Metric: Number of distinct lists of dependencies; Start: 3; Goal: 1
Estimate: 2–3 weeks
Subtasks:
Enumerate all packages used across each of the repositories.
Audit all packages and narrow list to only required packages with corresponding versions.
Identify which version each package should be upgraded to in Python 2.7.
随着软件团队开始在迁移方面取得进展,它可能会开始在其原始计划的副本中填写更多有关其发现的内容。我们可以在下面的计划中看到一些额外的细节:
As the software team begins making progress on the migration, it might start filling in the copy of its original plan with more context on its findings. We can see some of those additional details in the plan that follows:
创建单个requirements.txt文件。
指标:依赖项的不同列表数量;开始: 3;目标: 1
预计: 2-3 周
子任务:
枚举每个存储库中使用的所有软件包。当我们开始梳理三个存储库中的第一个存储库使用的所有软件包时,我们惊讶地发现代码依赖于六个额外的依赖项,而这些依赖项并未在相应的requirements.txt文件中明确列出。研究人员能够为第一个存储库提供更新列表,以及其他两个存储库的requirements.txt文件中缺少的 10 个其他依赖项。
审核所有软件包,并将列表缩小到仅具有相应版本的必需软件包。值得庆幸的是,三个存储库使用的软件包中有 80% 是相同的。在这组软件包中,只有 8 个软件包的版本不同,需要进行协调。
确定每个软件包应升级到 Python 2.7 中的哪个版本。对于最终合并的软件包集中的七个软件包来说,这是一个棘手的问题。对于这些软件包,它们的 2.7 兼容版本弃用了研究人员在三个存储库中的两个中积极使用的许多 API 和功能。我们与研究团队合作,逐步停止使用这些弃用的功能,然后再继续进行重构。
Create a single requirements.txt file.
Metric: Number of distinct lists of dependencies; Start: 3; Goal: 1
Estimate: 2–3 weeks
Subtasks:
Enumerate all packages used across each of the repositories. When we started combing through all of the packages used by the first of the three repositories, we were surprised that the code relied on six additional dependencies that weren’t explicitly listed in the respective requirements.txt file. The researchers were able to provide an updated list for the first repository as well as in the 10 other dependencies missing from the requirements.txt files for the other two repos.
Audit all packages and narrow list to only required packages with corresponding versions. Thankfully, 80 percent of the packages used by the three repos were the same. Of that set, only eight of those packages had different versions that needed to be reconciled.
Identify which version each package should be upgraded to in Python 2.7. This was tricky for seven of the packages in the final, combined set. For these packages, their 2.7-compatible versions deprecated a number of APIs and features that the researchers actively used in two of the three repos. We worked with the research team to gradually migrate away from using these deprecated features before continuing with the refactor.
随时更新执行计划意味着其他人可以在项目的整个生命周期内参考它,以获得更多有关团队在每个阶段所做的具体工作的背景信息。任何在后期里程碑加入项目的 SME(或在重构过程中加入的任何新队友)都可以通过阅读执行计划来了解团队迄今为止所做的所有工作。如果你和我一样,有时你会忘记几个月前为什么做出某个决定;通过详细记录你遇到的所有事情以及你一路上得出的结论,你可以轻松地回过头来提醒自己到底发生了什么以及为什么发生。
Updating the execution plan as you go means that others can reference it throughout the project’s lifetime to get more context on the specific work the team is doing during each of its many stages. Any SMEs joining the project at a later milestone (or any new teammates being onboarded partway through the refactor) can ramp up on everything the team’s worked on to date just by reading through the execution plan. If you’re anything like me, sometimes you forget why you made a certain decision several months ago; by keeping a verbose account of everything you’ve encountered and the conclusions you’ve reached along the way, you can easily go back and remind yourself of precisely what happened and why.
详细描述团队的经验也有助于工程师和经理在重构完成后参考。想要了解代码库如何随时间演变的工程师可能希望阅读您的详细计划。对于希望晋升的参与重构的工程师来说,拥有具体的文档来指出他们在每个步骤中解决的技术性问题可能非常有价值。在公司的其他地方,想要启动自己的大规模重构的工程师可能会从您的文档中寻找如何成功执行大规模重构的示例。
A detailed account of the team’s experience can also be helpful for engineers and managers referencing the refactor well after it’s completed. Engineers seeking to understand how the codebase has evolved over time may want to read your detailed plan. For the engineers involved with your refactor seeking a promotion, having concrete documentation pointing to the highly technical problems they solved at each step can be incredibly valuable. Elsewhere at the company, engineers looking to kick off their own large-scale refactor might look to your documentation for an example of how to execute a substantial refactor successfully.
在解决难题时,我们所有人都会向同行和经验丰富的同事寻求建议。虽然我们可能渴望向高级工程领导寻求反馈(并从中受益匪浅),但获得并留住他们的注意力可能非常困难。无论他们是从第一天开始就以 SME 的身份参与重构(参见第 6 章),还是刚刚开始熟悉,他们可能会更慢地回复您的询问,因为他们异常忙碌,肩负着众多项目的多项职责。理想情况下,如果您能够适当地传达您的期望,这些人中就不会有人成为瓶颈。
All of us seek advice from peers and experienced colleagues when solving difficult problems. While we might be eager to request feedback from senior engineering leaders (and benefit greatly from it), getting and retaining their attention can be notably difficult. Whether they’ve been engaged with the refactor from day one as SMEs (see Chapter 6), or are just getting up to speed, they’ll likely be slower to respond to your inquiries simply because they are unusually busy with many responsibilities across a multitude of projects. Ideally, if you are able to communicate your expectations appropriately, none of these individuals should become bottlenecks.
此处的“高级工程师”一词指的是团队、部门或整个公司中最有经验的个人贡献者,不要将其与行业中许多专业人士所拥有的高级工程师头衔混淆。这些人通常拥有更大的头衔,如高级职员、首席或杰出工程师。有时,这些人只是在公司任职时间最长的人。
The term “senior engineer” here refers to the most experienced individual contributors within a team, department, or company at large, not to be confused with the title, Senior Engineer, held by many professionals in the industry. These are usually the folks with much bigger titles like Senior Staff, Principle, or Distinguished Engineer. Sometimes, these are simply the folks who have been at the company the longest.
在向这些高级工程师领导征求反馈意见时,我们必须首先确定我们寻求反馈的范围。这主要有两个原因。首先,明确定义我们希望同事评估问题或解决方案的哪些方面,可以确保我们不会收到关于我们已经确定的部分的意外、令人沮丧的反馈。其次,他们将能够立即专注于必要的部分,从而节省大量的时间和精力,否则他们将花在评估更大的问题上。
When soliciting feedback from these senior engineer leaders, we must first decide the scope of the feedback we’re looking for. This is helpful for two main reasons. First, explicitly defining which aspects of the problem or solution we want our colleague to evaluate ensures that we won’t get unexpected, frustrating feedback on pieces we’ve already nailed down. Second, they’ll be able to focus immediately on just the essential pieces, saving them a great deal of time and energy they would have otherwise spent assessing a much greater problem.
接下来,我们必须确定他们的反馈对项目发展势头有多重要;也就是说,如果没有他们的意见,你还能继续取得进展吗?如果你认为你可以在没有他们的意见的情况下继续取得进展,那就明确说明。这样,工程师就可以优先考虑向你提供你需要的反馈,而不是他们可能正在处理来自公司其他工程师的类似请求。如果你认为你的团队需要他们的意见才能继续取得进展,那么让他们知道他们现在是一个阻碍因素,这应该会让他们有足够的紧迫感来迅速回复你。无论紧迫性如何,你都应该对何时需要他们的反馈设定一些明确的期望,这样就不会有人无所事事。
Next, we have to determine how crucial their feedback is to the momentum of the project; that is, can you continue to make progress without their input? If you believe you can continue to make progress without their opinion, be explicit about it. This way, the engineer can properly prioritize giving you the feedback you need against similar requests they might be juggling from other engineers across the company. If you believe their input is required for your team to continue making progress, letting them know that they are now a blocker should give them adequate urgency to get back to you quickly. Regardless of the urgency, you should set some clear expectations for when you need their feedback by so that no one is left twiddling their thumbs.
如果您已经让高级工程师知道他们的见解是一个障碍,请设定您希望何时收到他们的回复,如果您仍在等待回复,那么是时候采取主动了。如果他们的日程表上没有满满的会议,请安排一些时间与他们一对一讨论手头上的事情。(确保您的会议描述包含所有相关细节!)如果您只需要几分钟时间,请尝试在他们的办公桌旁停下来看看他们是否有空聊天,或者在他们离开会议的路上见到他们。当你们面对面时,推迟与某人交谈要困难得多。
If you’ve let a senior engineer know that their insights are a blocker, set expectations for when you’d like to have heard from them, and if you are still waiting for a reply, it’s time to get assertive. If their calendar isn’t flooded with meetings, book some time with them one on one to discuss the item at hand. (Be certain that your meeting description has all the pertinent details!) If you just need a few minutes of their time, try stopping by their desk and seeing whether they’re available to chat, or catch them on their way out of a meeting. It’s much more difficult to put off talking to someone when you’re face to face.
我们还需要考虑高级工程师领导认为他们自己的反馈对项目发展势头有多重要。如果你们都同意他们的意见不会成为阻碍,那太好了!但如果你在不考虑他们的意见的情况下继续前进,他们可能会感到惊讶和不满,你需要意识到这一点,以便每个人的期望都能得到适当的调整。
We also need to consider how crucial the senior engineer leader believes their own feedback is to the momentum of the project. If you both agree that their input is not a blocker, great! But if there’s a chance that they’ll be surprised and disgruntled if you move forward without taking their opinion into consideration, you need to be aware of it so that everyone’s expectations are properly aligned.
为了说明这一点,假设您正在为一个新库开发原型,而这个新库是大型重构的一部分。您的原型定义了一些基本接口,以及一些临时的、不完整的实现。您将更改提交给代码审查,并附上简短的描述和您团队开发的设计文档的链接。您希望高级工程师提供一些反馈,因此您标记了他们和其他几位队友以供审查。不幸的是,您忘了告诉高级工程师,您正在寻求有关接口(而不是实现)的反馈,并希望在下周内合并这些更改。
To illustrate this in action, let’s say you’re working on a prototype for a new library you’re building as part of a large refactor. Your prototype defines some basic interfaces, with a handful of temporary, incomplete implementations. You put up your changes for code review, complete with a short description and links to a design document your team developed. You want some feedback from a senior engineer, so you tag them and a few other teammates for review. Unfortunately, you forget to tell the senior engineer that you’re looking for feedback on the interfaces (not the implementations) and are hoping to merge these changes within the next week.
几天过去了,你的队友们发表了意见,但高级工程师却没有任何回应。你给他们发了一条消息,询问他们是否有机会查看代码审查。他们向你保证,他们看到了请求,并打算在本周末之前完成。在与 队友反复沟通后,你决定合并原型并在后续的代码审查中继续对其进行迭代。
A few days pass with comments from your teammates but nothing from the senior engineer. You send them a message asking whether they’ve had a chance to take a look at the code review. They assure you that they saw the request and that they intend to get to it by the end of the week. After some back and forth with your teammates, you decide to merge the prototype and continue iterating on it in subsequent code reviews.
一天后,高级工程师打开了你的代码审查并开始通读。他们立即开始评论实施细节,当他们意识到代码已经合并时,他们变得越来越惊慌。现在每个人都很生气:你很生气,因为高级工程师花了太长时间审查你的更改,最终把重点放在了代码的错误方面;他们很生气,因为他们在临时代码上留下了评论,而你没有等待他们的意见就合并了更改。如果从一开始就设定了正确的期望,所有的失望和误解都可以避免。
A day later, the senior engineer opens up your code review and begins to read through it. They immediately begin commenting on the implementation details, becoming increasingly alarmed as they realize that the code has already been merged. Now everyone’s irritated: you’re irritated that the senior engineer took too long to review your changes and ultimately focused on the wrong aspect of the code; they’re irritated that they left comments on what turned out to be temporary code and that you’ve merged your changes without waiting for their input. All of the disappointment and miscommunication could have been avoided had the right expectations been set from the start.
如果本章只能让你学到一件事,那就是:没有单一正确的沟通策略。每次重构都需要不同的沟通策略,这些策略可能会在项目的整个生命周期中发生变化。你养成的习惯应该由重构的每个方面塑造:你聚集的团队、受变更影响的工程团队以及外部利益相关者的参与程度。
If there’s only one thing you take away from this chapter, it should be this: there is no single correct communication strategy. Every refactor needs different communication strategies, and these strategies can change throughout the lifetime of the project. The habits you establish should be molded by each of the facets that makes the refactor unique: the team you’ve gathered, the engineering groups affected by the changes, and the level of involvement of external stakeholders.
如果在任何时候你发现你的习惯不再对你有益,那就改变一下吧!在最好的情况下,良好的沟通习惯可以让你的团队以可持续、稳定的速度有效工作。在最坏的情况下,不良的沟通习惯会阻碍你的团队,并积极阻止项目向前发展。如果某件事不起作用,你最好尝试改变它,而不是坚持那些会拖慢你进度的习惯。
If at any point you find that your habits are no longer serving you well, shake things up! In the best case, great communication habits can keep your team working effectively at a sustainable, steady pace. In the worst case, bad communication habits can hold your team back and actively prevent the project from moving forward. If something isn’t working, you’re much better off attempting to change it than sticking with habits that could slow you down.
下一章将继续讨论如何建立模式,帮助您和您的团队高效地执行任务。我们将重点介绍您的团队在重构开发过程中可能想要尝试的各种想法(包括技术和非技术方面)。
Our next chapter continues with the theme of establishing patterns that help you and your team execute in a productive way. We’ll highlight an assortment of ideas (both technical and nontechnical) your team might want to try throughout the refactor’s development.
纽约地铁于 1904 年开通,是世界上最古老、使用率最高的公共交通系统之一,平均每个工作日服务近 600 万乘客。我们这些熟悉庞大地铁网络的人已经开发出几十种细微的优化方法,让乘坐地铁成为一种习惯。我们在周二深夜收听服务变更公告。我们知道在旋转门上扫描 MetroCard 的准确力度和角度。对于初来乍到的人们,我们可以分享一些小而有效的技巧,让他们的前几次出行不那么紧张。
Opened in 1904, the New York City subway is among the world’s oldest and most-used public transit systems, serving just under six million riders on an average weekday. Those of us who are intimately familiar with the sprawling network have developed dozens of tiny optimizations that make riding the subway second nature. We listen for announcements to changes in service late at night on a Tuesday. We know the precise force and angle with which to scan our MetroCards through the turnstiles. For newcomers to the city, we can share some of these small but mighty tips to make their first few trips a bit less hectic.
可以将本章想象成友好的纽约人,在您开始探索城市地铁系统时为您提供建议。它包含一系列技巧,可帮助您在整个重构过程中顺利执行。我们首先介绍良好的团队建设实践。除了建立定期沟通习惯之外,我们还可以采取多种方式来保持团队成员的高效和快乐。接下来,我们将介绍重构过程中您应该跟踪的一些事项,以确保您保持正轨,并确切知道在重构的最后阶段需要注意什么。最后,我们将讨论一些编码策略,以便在实施重构时牢牢控制重构。
Think of this chapter as like the friendly New Yorker giving you advice as you set out to navigate the city’s subway system. It contains a medley of tips for promoting smooth execution throughout a refactor. We’ll first touch on good team-building practices. There are a handful of ways we can go beyond establishing regular communication habits to keep our teammates productive and happy. Next, we’ll cover a few items you should be keeping track of during the refactor to make sure that you’re staying on course and know precisely what to attend to when you’ve reached the final stages of the refactor. Finally, we’ll discuss a few coding strategies to keep sturdy reins on the refactor as you’re implementing it.
在第 6 章中,我们探讨了在大型软件项目(包括雄心勃勃的重构)中拥有强大团队的几个重要原因。我们主要关注在困难时期(例如,当项目进入平淡阶段或遇到新障碍时)拥有可靠的队友的好处。我们没有提到的是,合作良好的团队更有创造力,相互学习更多,最终更好更快地解决问题。为了实现这一目标,你和你的队友必须优先考虑定期参加团队建设活动。这里概述的选项并不详尽,但我相信它们是一些最有用的习惯,可以加强你与队友的关系。一旦你在他们周围建立了肌肉记忆,它们就会成为第二天性,并且肯定会让重构顺利进行。
In Chapter 6, we examined a few reasons having a strong team is important within the context of large software projects, including ambitious refactors. We mostly focused on the benefits of having reliable teammates during difficult times (e.g., when the project reaches a mundane stage or hits a new roadblock). What we didn’t mention is that teams that work well together are more creative, learn more from one another, and ultimately solve problems better and faster. To that goal, it’s vital for you and your teammates to prioritize regularly participating in team-building activities. The options outlined here are not exhaustive, but I believe that they are some of the most useful habits to develop to strengthen your relationship with your teammates. Once you’ve built up the muscle memory around them, they’ll become second nature and will surely make the refactor fly by smoothly.
结对编程是一种很好的团队建设工具。一起解决问题为参与者提供了一个很好的机会,让他们能够在协作、低风险的环境中了解彼此的优势(和劣势)。如果您的团队还没有太多的合作经验,请考虑鼓励他们在项目开始时结对完成一些任务。尽早开始很重要;新项目不仅为您提供了从一开始就养成良好习惯的独特机会,而且尽早了解队友的能力可以帮助项目顺利起步并继续高效地向前发展。
Pair programming is a great team-building tool. Working on a problem together gives the participants a great opportunity to learn each other’s strengths (and weaknesses) in a collaborative, low-stakes environment. If your team hasn’t had much experience working together yet, consider encouraging them to pair upon a handful of tasks at the onset of the project. Starting early is important; not only does a new project give you the unique opportunity to set good habits from the very start, understanding your teammates’ abilities early can help the project start off on the right foot and continue to make forward progress efficiently.
更实际的是,结对编程也是将知识从一个队友传递给另一个队友的好方法。仅凭一己之力理解某个系统一个或多个部分的工程师会对你的项目造成负担,而且往往还会对整个公司造成负担。在许多情况下,这些工程师可能会觉得他们无法请假或完全脱离工作几天,因为他们担心在发生只有他们知道的系统部分的紧急情况时需要他们帮忙。为确保团队中没有一个开发人员是知识孤岛,你可以设置结对会议,以此将他们的专业知识传递给团队中的其他人。如果重构的任何方面出现问题,那么在每个团队成员之间均匀分配知识可以减轻任何单个开发人员的负担。
More practically, pair programming can also be a great way to transfer knowledge from one teammate to another. Engineers who are alone in understanding one or more pieces of a given system are a liability to your project and, not infrequently, your company as whole. In many cases, these engineers may feel that they are unable to take time off or completely disconnect from work for a few days out of fear that they’ll be needed in the event of an emergency with the part of the system only they know. To ensure that no single developer on your team is a knowledge island, you can set up pairing sessions as a means of transferring their expertise to others on the team. Evenly distributing knowledge across each of your team members lightens the load on any single developer if problems arise with any aspect of the refactor.
结对编程也是调试或解决困难或抽象问题的一种好方法。我们说三个臭皮匠顶一个诸葛亮是有原因的:通过让两个工程师思考同一个问题,你更有可能想出更多不同的解决方案,并更快地找到一个行之有效的解决方案。积极的交流有助于你直面分歧,更有效地改进解决方案。当你一起解决问题时,你犯的错误就会减少;事实上,犹他大学的研究表明,结对编写的代码可以减少大约 15% 的错误。最后,你不太可能分心;因为你们都投入了时间和精力一起解决问题,大声推理问题,所以查看电子邮件或给某人发消息的诱惑就会减少。
Pairing can also be a great way to debug or solve a difficult or abstract problem. We say two heads are better than one for a reason: by having two engineers thinking through the same problem, you’re more likely to come up with a greater variety of solutions, landing on one that works well sooner. The active back and forth helps you address disagreements head-on, refining your solution more effectively. As you navigate through the problem together, you’ll end up making fewer mistakes; in fact, research out of the University of Utah shows that code written in pairs results in about 15 percent fewer bugs. Finally, you’re less likely to get distracted; because you’re both committing the time and energy to solving a problem together, reasoning through the problem out loud, the temptation to check your email or shoot someone a message decreases.
两人一组进行重构特别有效,因为当一个人打字时,另一个人可以更自由地思考大局。重构时,很容易陷入困境,试图理清那些往往令人困惑的遗留代码。你的搭档可以帮助你重新专注于更大的目标,并通过进一步思考问题,指出你在开发过程早期可能遇到的任何陷阱。
Refactoring in pairs can be particularly effective because while one person is typing, the other is freer to think about the bigger picture. When refactoring, it’s easy to get stuck in the weeds trying to untangle what often tends to be confusing, legacy code. Your pair can help you regain focus on the greater goal and, by thinking through the problem a few steps further, point out any pitfalls you may run into earlier in the development process.
然而,结对编程并非没有缺点。当涉及到确定问题范围或学习新知识(例如,使用框架、采用工具、学习编程语言)时,包括我在内的一些工程师更喜欢自己做。我发现,如果我第一次就学会了重要的概念,我就能更好地记住它们。对于定义明确且相对容易解决的问题,结对编程并不是一种特别有效的方法;虽然有一点机会你可以更快地解决任务并产生更少的错误结果,但将两名工程师的时间花在一个简单的任务上并不总是团队资源的最佳利用方式。
Pair programming isn’t without its downsides, however. When it comes to scoping out a problem or learning something new (e.g., using a framework, adopting a tool, learning a programming language), some engineers, including me, prefer to do so on their own. I find that I’m able to retain important concepts better if I stumbled through learning them the first time around. For problems that are well-defined and relatively straightforward to solve, pairing isn’t a particularly productive approach; while there is a slight chance you might solve the task more quickly and produce a less buggy outcome, tying up two engineers’ time on a simple task is not always the best use of resources on your team.
结对编程对二人组来说也是一项耗费精力的工作。需要在持续的一段时间内清晰地表达你的思考过程,这比独自安静地工作、在内心思考问题要耗费更多的精力。在结对编程结束时,你可能需要休息一下,换个方式来补充能量。对于那些不擅长口头交流的开发人员来说,结对编程可能特别具有挑战性,让任何结对编程练习都感觉像是一件苦差事。这就是为什么在提倡结对编程时,要注意团队中每个人的能力和偏好。
Pairing can also be a draining task for the duo. Needing to articulate your thinking process over a sustained period of time takes up quite a bit more energy than quietly working on your own, reasoning through the problem internally. By the end of a pairing session, you might need to take a break and switch gears to recharge. For developers who aren’t great verbal communicators, pairing can be especially challenging, making any pair programming exercise feel like a chore. This is why it’s important to be mindful of every one of the team’s abilities and preferences when advocating for pairing.
考虑到这些缺点,这里有一些关于如何在团队中建立配对的建议:
In being mindful of drawbacks, here are a few recommendations for how to institute pairing on your team:
您的团队中很可能有一些成员是结对编程的忠实支持者,而其他成员则不是。通过强调其好处并强调您对这种做法的支持,您有望说服那些犹豫不决(或从未尝试过)的人尝试一下。(并且希望在尝试之后,他们会渴望重复这一练习。)另一方面,强迫那些不舒服的人结对编程可能是灾难的根源;他们可能会逐渐憎恨团队和项目,从而导致他们寻找出路。
There’s a strong chance that some members of your team are great proponents of pair programming and others are not. By highlighting its benefits and underscoring your support for the practice, you’ll hopefully convince those who are on the fence (or have never tried it before) to give it a go. (And hopefully, after having tried it, they’ll be eager to repeat the exercise.) On the other hand, forcing those who are uncomfortable to pair can be a recipe for disaster; they may grow to resent the team and the project, leading them to seek a way out.
除非你使用结对编程作为一种工具来向初级人员传授特定知识,否则你最好将技能水平相近的工程师结对。在解决难题或调试问题时,水平相近的开发人员不太可能因为对方缺乏经验而感到沮丧。如果你们的技术水平相当,你们就能更有效地交流想法。
Unless you’re using pair programming as a tool to teach something specific to someone more junior, you’re better off pairing like-skilled engineers. When working through a difficult problem or debugging an issue, developers who are at a similar level are less likely to be frustrated by the other’s lack of experience. You’ll more effectively bounce ideas off of one another if you’re at comparable levels in your technical ability.
因为结对编程可能很费力,所以给会话设定一个明确的截止时间(根据需要休息)很重要。从一小时开始,如果到了时间结束,你还有精力(和时间)继续下去,就再延长一小时。给彼此一个结束一天的机会;你不希望结对超出你们双方的能力范围,冒着不必要地降低会话效率的风险。
Because pair programming can be taxing, it’s important to give the session a well-defined cut-off (with breaks as needed). Start with an hour, and if you come to the end of the time and you have the energy (and time) to keep going, extend your session by another hour. Give each other an opportunity to call it a day; you don’t want the pairing to stretch beyond either of your capacities and risk needlessly decreasing the efficiency of the session.
在第 4 章中,我们讨论了如何制定一个有重点、适当平衡的执行计划,为团队提供足够的灵活性,以防止精疲力竭。我们可以花时间表彰我们的队友并庆祝我们一路取得的成就,从而进一步确保我们的团队在整个长期大规模重构过程中保持积极性。您的团队不需要花费大量预算购买品牌马克杯或参加令人垂涎的场外活动来建立团队间有意义的联系或突出团队的贡献。有许多简单但有效的方法可以保持每个人的士气。
In Chapter 4, we discussed building a focused, properly balanced execution plan that gives the team enough flexibility to prevent exhaustion. We can further ensure that our teams stay motivated throughout a long at-scale refactor by taking the time to recognize our teammates and celebrate our achievements along the way. Your team doesn’t need a massive budget for branded mugs or access to coveted off-site activities to build meaningful connections across the team or highlight the group’s contributions. There are a number of simple but effective ways to keep everyone’s morale up.
首先,我们将考虑如何保持个人积极性。我们可以提高队友动力的更有说服力的方法之一是让他们有机会以最佳方式利用他们独特的技能和能力为重构做出贡献。如果您的队友正在从事他们认为最有价值的重构部分,他们会更快乐(并且可能更有效率)。如果您的队友正在寻找成长的机会,无论是通过开发新的技术技能还是通过监督项目中更重要的部分,请尽力为他们提供这些机会。还记得您如何向这位队友提出加入您工作的想法,在第6 章中为他们提供更大的知名度或责任(甚至可能是晋升)的机会。
First, we’ll consider how we can keep individuals motivated. One of the more compelling ways we can boost a teammate’s drive is by giving them the opportunity to contribute to the refactor in a way that best leverages their unique skills and abilities. Your teammates will be much happier (and likely more productive) if they are working on pieces of the refactor that they find to be the most rewarding. If your teammates are looking for opportunities to grow, whether by developing a new technical skill or by overseeing a more significant portion of the project, do your best to make these opportunities available to them. Remember how you pitched this teammate the idea of joining your effort in Chapter 6 by offering them the opportunity for greater visibility or responsibility (and perhaps even a promotion.)
如果可能的话,请让您的队友灵活地选择工作时间、地点和方式。并非每个人都适合每天从上午 9 点工作到下午 5 点,中午有半小时的午餐时间。有些人可能更喜欢黎明时分进入办公室,下午早些时候出门。其他人可能只在上午 10 点登录,下午 10 点接孩子,晚饭后收工。如果您可以适应队友的各种日程安排,同时继续保持良好的沟通习惯(参见第 7 章),他们不仅会感激,而且总体上可能会更有效率!
If possible, give your teammates the flexibility to choose when, where, and how they work. Not everyone is cut out to work from 9 a.m. to 5 p.m. with a half hour for lunch at noon every day. Some might prefer to come in to the office at the crack of dawn and head out in the early afternoon. Others might only log on midmorning, pick up their children midafternoon, and wrap up after dinner. If you can accommodate your teammates’ assorted schedules while continuing to maintain good communication practices (see Chapter 7), they will be not only be thankful, but likely even more productive overall!
表彰团队成员的独特贡献是激励他们的绝佳方式。通过向他们表明你和团队其他成员欣赏他们的辛勤工作,你重申了他们正在做正确的事情,鼓励他们继续前进,并培养团队的归属感。表彰可以采取任何形式:可以通过正式的部门或公司范围的计划,也可以像手写便条一样简单。注意你的队友喜欢的表彰方式。虽然有些人喜欢在全体会议上听到自己的名字被点名,但有些人却不愿公开表扬。以错误的形式表彰,在最好的情况下也不是很有效,在最坏的情况下则完全失败。有时,一封深思熟虑的电子邮件或热情洋溢的同行评审就足够了。
Recognizing individual teammates for their distinct contributions is a great way to keep them motivated. By showing them that you and the rest of your team appreciate their hard work, you’re reaffirming that they are doing the right thing, encouraging them to keep going, and fostering a sense of belonging on the team. Recognition can take just about any shape: it can be through a formalized department- or company-wide program, or can be as simple as crafting a handwritten note. Be mindful of your teammates’ preferred way of being recognized. Although some enjoy hearing their name called out at an all-hands, others shy away from public praise. Recognition in the wrong form is at best not very effective, and at worst a total fiasco. Sometimes, a thoughtful email or glowing peer review is more than enough.
您的经理可以为您提供宝贵的资源,帮助您找到认可整个团队的方法。(如果您希望获得计划的预算,您可能需要他们的支持。)话虽如此,让团队认可同事具有独特的价值。
Your manager can be a great asset for helping you set up ways to recognize your team as a whole. (You’ll probably need their support if you’re hoping to get a budget for whatever you’re planning to put together.) That said, there is unique value in having the team recognize its peers.
例如,您可以制定一个轻松的“每周获胜者”传统。首先,团队将获得一个小奖杯(或任何从队友的办公桌上清晰可见的物品),并选择某人以表彰其在上周所做的出色工作。这可以是任何事情,从介入帮助解决棘手的错误,到为给定的补丁撰写出色的描述。下一周,获胜者将选择下一个获胜者,并将奖杯传给下一个获胜者。这个传统一直持续到项目结束或团队选择退役为止。
You could, for example, put together a lightweight “Win of the Week” tradition. To kick it off, the team acquires a small trophy (or any item clearly visible from a teammate’s desk) and chooses someone to recognize for excellent work done over the previous week. This could be anything from stepping in to help resolve a tricky bug, or crafting a great description for a given patch. The following week, the winner chooses the next winner, passing on the trophy. The tradition continues until the project wraps up or until the team chooses to retire it.
接下来,我们将介绍一些有助于保持整个团队积极性的有用方法。让每个人都对出色工作感到兴奋的一种几乎万无一失的方法是将其变成游戏。通过将重构中较为平凡的部分游戏化,您可能会发现您的队友渴望完成任务并更快地朝着里程碑前进。一个简单的宾果游戏就是一个很好的例子。确定您的团队在重构的当前里程碑期间可以做出的细小但重要的贡献,并将其放入宾果游戏表生成工具中。这些可以简单到与某人结对解决一个难题或完成 10 次代码审查。您可以打印出板子并将其分发给您的团队,并为 获胜者提供小奖品。
Next, we’ll take a look at helpful methods for keeping your team motivated as a whole. A near foolproof way to get everyone excited about doing great work is to turn it into a game. By gamifying the more mundane portions of the refactor, you may find your teammates eager to complete tasks and progressing toward milestones more quickly. A good example might be a simple game of Bingo. Identify small but important contributions your team can make during the refactor’s current milestone and plop them into a Bingo game sheet generation tool. These can be as simple as pairing with someone on a difficult problem or completing 10 code reviews. You can print out the boards and distribute them to your team and offer a small prize for winners.
在将任意数量的任务游戏化时,请注意不要激起过多的竞争。虽然竞争可以成为一种很好的激励因素,但如果失控,您就会冒着引发冲突、士气和团队合作恶化的风险。故意将团队合作的某些方面融入游戏;这将鼓励每个人与周围的人一起努力,进一步巩固您的团队。在大规模重构中,几乎没有(如果有的话)允许草率执行的空间,因此您还需要注意主要激励工作的质量而不是完成度。如果您强调到达终点线,您的队友可能会为了更快到达终点而走捷径。
When gamifying any number of tasks, be mindful not to incite too much competition. While it can be a great motivator, if it gets out of hand you’ll risk sparking conflict and seeing morale and teamwork deteriorate. Incorporate aspects of teamwork into the game deliberately; this will encourage everyone to pull up those around them and further solidify your team. With a large-scale refactor, there is very little room (if any) for sloppy execution, so you’ll also want to be careful to chiefly incentivize the quality of the work rather than its completion. If you put emphasis on reaching the finish line, your teammates might cut some corners in an attempt to get there faster.
在规划大型项目里程碑中较小子任务的估算时,请考虑将部分流程游戏化。让团队的每个成员提交他们对您何时达到目标指标的最佳猜测,遵循“价格合理”规则(即最接近但不超过)。当您达到指标时,在下次团队会议上用鼓声揭晓获胜者。每个人都会因为试图一针见血而感到兴奋,您的估算可能会随着时间的推移变得更好!
When planning estimates for smaller subtasks within larger project milestones, consider gamifying part of the process. Have each member of the team submit their best guess of when you’ll hit a target metric, following The Price is Right rules (i.e., closest without going over). When you reach the metric, recognize the winner with a drumroll reveal at your next team meeting. Everyone will get a kick out of trying to hit the nail on the head and your estimates might get better over time!
最后,记得在整个项目期间举行一两次聚会来庆祝团队的成就,特别是在完成重要里程碑之后。庆祝的时刻有助于创造持续的参与度并保持良好的士气。如果团队从来没有机会停下来纪念彼此的努力,你的重构就会开始感觉像一场无休止的老鼠赛跑。抽出一些时间以最有效的方式把大家聚在一起,无论是团队聚餐还是午后咖啡吐司。你们都会感激自己花了一点时间来反思自己的成就。
Finally, remember to celebrate your team’s achievements with a gathering or two speckled throughout the project, particularly after concluding significant milestones. Moments of celebration help create sustained engagement and maintain good morale. If the team never has the opportunity to hit pause and commemorate each other’s efforts, your refactor will begin to feel like an endless rat race. Carve out some time to bring everyone together whichever way works best, whether that’s a team potluck lunch or a midafternoon coffee toast. You’ll all be thankful to have taken a moment to reflect on your accomplishments.
在执行重构时,经常检查进度并持续记录重要发现非常重要。通过经常测量和反思,您将更加确信项目正朝着正确的方向发展,并降低团队在重构的最后阶段忘记重要事项的可能性。务必使用项目中期更新继续更新“执行计划”中讨论的执行计划的现行版本。
As you’re executing your refactor, it’s important to check on your progress frequently and maintain a running tally of important findings. By measuring and reflecting often, you’ll be more confident that the project is headed in the right direction and decrease the likelihood that your team forgets something important in the final stages of the refactor. Be certain to continue to update the living version of your execution plan, discussed in “Execution plan”, with your midproject updates.
在第 3 章中,我们研究了多种不同的方法来描述我们想要通过重构解决的问题。后来,我们使用这些指标来指导我们的执行计划,并进一步将项目分解为单独的里程碑,每个里程碑都有自己的一套指标。我们不应该在积极执行重构时忽视这些目标,否则可能会偏离轨道。每一个雄心勃勃的软件项目,都存在着一个重大而危险的范围蔓延机会。
In Chapter 3, we examined a number of distinct ways to characterize the problems we aim to fix with our refactor. We later used those metrics to inform our execution plan, and further broke down the project into individual milestones, each with its own set of metrics. We shouldn’t lose sight of these goals while actively executing the refactor at the risk of veering off course. With every ambitious software project, there is a significant and dangerous opportunity for scope creep at every turn.
通过每周(或每两周)衡量团队在每个中级指标上的进展,你们就要对推动你们认为最重要的目标的实现负责。通过频繁的检查,团队不太可能受到诱惑而开始任何无关紧要的支线任务,从而可以扩大项目范围。定期检查还能让你评估速度。如果每个人都专注于正确的任务,但连续几周指标都没有什么积极的变化,那么显然出了问题。也许团队正在努力取得实质性进展,因为它 继续遇到许多棘手的错误,或者这些指标并不是传达团队贡献的理想选择。无论潜在的困境是什么,当你再次开始注意到指标的良好变化时,你就会知道你成功地解决了它。
By measuring the team’s progress toward each intermediate metric on a weekly (or biweekly) basis, you are holding yourselves responsible for moving the needle forward on the goals you’ve identified as the most important. With frequent check-ins, the team is less likely to give in to the temptation to embark on any tangential side quests, allowing for the project scope to increase. Periodic check-ins also give you the ability to assess your velocity. If everyone is focused on the right tasks, but there is little positive change in the metrics for several weeks in a row, something is clearly amiss. Perhaps the team is struggling to make substantial progress because it continues to encounter a number of difficult bugs, or the metrics are not ideal candidates for conveying your team’s contributions. Whatever the underlying dilemma, you’ll know you successfully solved it when you begin to notice a good change in your metrics once more.
无论您的重构是否出于发现和修复系统性错误的动机,您在整个过程中都必然会遇到一些缺陷。对于每个错误,无论您决定如何处理它(修复或不修复),您都应该记录它在项目中被发现的时间、它出现的条件(以便于重现)以及采取了哪些措施。在重构的上下文中面对错误时,通常有两种选择;第一种是修复错误,另一种是重新实现它。
Regardless of whether your refactor is motivated by the desire to surface and fix systemic bugs, you are bound to encounter a handful of defects throughout the endeavor. For each bug, no matter what you decide to do about it (fix it or not), you should document when in the project it was uncovered, the conditions under which it arises (for easy reproduction), and what actions were taken as a result. There are typically two options when confronting a bug within the context of a refactor; the first is to fix the bug, and the other is to reimplement it.
考虑一下您的团队修复错误的情况。如果重构后修复过程简单而干净,那么有一个可以快速参考的示例来证明其有效性,可以方便地向利益相关者展示或与同事分享。有时,仅仅一两个棘手的、记录良好的错误就可以说服任何最初对重构持观望态度的人,重构是值得的。另一方面,如果您的团队将错误移植到重构中,您需要确切地知道在哪里找到它以及如何重现它以修复它或将其交给适当的团队进行修补。
Consider the case in which your team fixes the bug. If the fix is easy and clean as a result of the refactor, having an example you can quickly reference to demonstrate its efficacy is convenient to show it to stakeholders or share it with peers. Sometimes, just one or two thorny, well-documented bugs can convince anyone who was initially on the fence about the refactor that it is well worthwhile. On the other hand, if your team ports the bug into the refactor, you’ll need to know precisely where to find it and how to reproduce it either to fix it or to hand it off to the appropriate team to patch.
在“清理工件”中,我们讨论了在执行计划中包括一个用于清理重构期间产生的工件的独特阶段的重要性。每次重构都应优先考虑让代码库保持有序状态,以便其他开发人员使用;毕竟,通常大规模重构的主要动机是改善代码库的人体工程学。虽然在编写第一行代码之前,我们可能对整个项目将生成的工件类型有一个适度的直觉,但毫无疑问,我们会在运行中创建各种各样的工件。
In “Cleaning Up Artifacts”, we looked at the importance of including a distinct phase in our execution plan for cleaning up artifacts produced during the refactor. Every refactor should prioritize leaving the codebase in an orderly state for other developers; after all, usually a substantial motivation for a large refactor is to improve the ergonomics of your codebase. While we might have a modest intuition about the kinds of artifacts we’ll be generating throughout the project well before we write our first line of code, there will undoubtedly be an assortment of them we create on the fly.
跟踪所有需要整理的内容,无论您是计划在当前里程碑结束时还是仅在项目的最后阶段处理混乱情况。在使一段代码过时时立即更新列表至关重要;这样,一旦进入清理阶段,您就一定会删除每个相关工件。与新重构代码交互的工程师将感谢有序的体验。
Keep track of everything that’ll need tidying, whether you plan to tackle the clutter at the end of your current milestone or only in the final stages of the project. Updating your list immediately as you render a section of code obsolete is critical; this way, you’ll be certain to remove each relevant artifact once you’ve reached the cleanup phase. The engineers who interface with the newly refactored code will be grateful for an orderly experience.
就像厨师建议在做饭时清洗锅碗瓢盆一样,我建议在重构过程中不断进行清理。在使代码片段变得不再必要后立即删除它们要容易得多(也更安全)。在这个阶段,新过时的代码与重构的其余部分之间的无数交互还历历在目,您在将其清除时犯的错误更少。
Just as a cook would recommend cleaning pots and pans as you use them when preparing a meal, I recommend continually cleaning up as a refactor progresses. It is far easier (and safer) to remove pieces of code soon after rendering them unnecessary. At this stage, the myriad of interactions between the newly obsolete code and the remainder of the refactor is fresh in your mind, and you risk making fewer mistakes extricating it.
几乎团队中的每个工程师都会在重构的生命周期中遇到一些扩大重构范围的机会。显然,如果每个人都抵制住这种诱惑,您的项目将有更好的机会在重要的截止日期前完成,但这些机会的扩展不应被完全忽略。考虑列出您遇到的扩展项目的机会。拥有一套简洁的衍生项目可以展示重构的多功能性;如果有大量不同的方法可以利用项目的势头来继续改进代码库,您的利益相关者(以及同事)将更有可能相信重构是一项有价值的努力。如果您自己的团队(或公司中的任何其他团队)希望在重构建立的基础上继续在完成后对代码库进行渐进式改进,他们可以从此列表中确定几个项目并立即启动它们。
Nearly every engineer on your team will encounter a few opportunities to add scope to the refactor during its lifetime. Obviously, your project will have a better chance at hitting its important deadlines if everyone resists the temptation, but these opportune extensions should not be outright ignored. Consider keeping a list of the opportunities you encounter to expand on the project. Having a succinct set of spin-off projects can demonstrate the versatility of your refactor; if there is a broad number of distinct ways to capitalize on the project’s momentum to continue to improve the codebase, your stakeholders (and peers alike) will be more likely to believe the refactor was a valuable endeavor. If your own team (or any other team at the company) wants to build upon the foundation established by the refactor and continue making incremental improvements to the codebase following its completion, they could scope out a few projects from this list and kick them off immediately.
您可以采用一些有用的策略,让漫长的重构过程对您自己和团队成员来说都更加愉快。大型软件项目的开发并不总是很棘手;事实上,在编写全新的东西时,可能只有少数困难的操作,其中大多数操作只有在将功能嵌入现有代码库时才有必要。另一方面,当需要为重构编写大量代码时,其中大部分是现有行为的副本,需要精心设计并与原始实现巧妙集成。艰苦的过程失败的机会要大得多。希望您可以学会通过遵循本节中描述的技术成功驾驭重构开发过程。
There is a handful of useful strategies you can adopt to make a lengthy refactor much more pleasant for both yourself and your team members. Large software projects are not always tricky to develop; in fact, when writing something entirely new, there might be only a handful of difficult maneuvers, most of which are necessary only when embedding the feature into the existing codebase. On the other hand, when a significant amount of code needs to be written for a refactor, the majority of it a copy of existing behavior, it needs to be carefully designed and delicately integrated with its original implementation. There are considerably more opportunities for the painstaking process to fail. Hopefully, you can learn to navigate the refactoring development process successfully by following the techniques described in this section.
当我们在第 4 章中着手起草重构计划时,我们的目标是达到适当的详细程度。我们希望该计划对那些可能不熟悉技术细节的重要利益相关者来说易于理解,但又足够具体,以便我们能够正确地向团队通报项目情况并开始毫不含糊地执行。如果计划故意保持模糊,那么这正是进行原型设计的绝佳机会。
When we set out to draft a plan for our refactor in Chapter 4, we aimed to strike the right level of detail. We wanted the plan to be approachable for important stakeholders who might not be intimately familiar with the technical details, but sufficiently specific that we could properly inform a team about the project and begin execution without ambiguity. Where the plan remained deliberately vague is a perfect opportunity for prototyping.
如果您遵循两个重要原则,尽早并经常进行原型设计可以帮助您的团队更快地行动:
Prototyping early and often helps your team ultimately move faster if you abide by two important principles:
专注于制定一个总体上行之有效的解决方案,注意不要花太多时间完善细节。请记住,即使我们花了几个小时尝试设计理想的解决方案,未来需求的变化也可能使它过时。(我们在第 2 章中看到了一些具体的例子。)一个好的解决方案是能够很好地解决最重要的问题,并允许相当大的灵活性。
Focus on crafting a solution that works well overall, being mindful about not spending too much time perfecting the details. Remember that even if we spent hours attempting to devise the ideal solution, a future change in requirements might render it obsolete. (We saw a few concrete examples of this in Chapter 2.) A great solution is one that solves the most important problems well and allows for a fair amount of flexibility down the line.
如果我们花了一两周时间编写的解决方案根本无法实现,那么就把有用的部分拿走,把其余部分扔掉,然后重新开始。原型设计就是尝试一些东西,从经验中学习,然后重新开始。
If we spend a week or two writing a solution that simply doesn’t deliver, take the pieces that work, throw the rest away, and start again. Prototyping is all about trying something, learning from that experience, and starting again.
让我们考虑一下重构,您的团队希望将一个臃肿的类拆分成几个不同的组件。您的团队提出了一个初步设计,将其主要职责划分为三个新类,但还有许多次要但重要的职责尚未分配给其中任何一个。您没有在流程早期全心全意地致力于解决方案,而是决定对几个选项进行原型设计,在代码库的几个说明性部分中尝试新类的人体工程学。有了原型,您的团队就可以决定什么可行,什么不可行,并制定出一个可以与代码库的其余部分很好地集成的解决方案。
Let’s consider a refactor in which your team wants to split up a bloated class into a few distinct components. Your team came up with a preliminary design that divides its primary responsibilities into three new classes, but there are a number of minor, albeit important, responsibilities that have yet to be assigned to any one of them. Instead of committing wholeheartedly to a solution early in the process, you decide to prototype a few options, trying out the ergonomics of the new classes in just a few illustrative sections of the codebase. Given the prototypes, your team is able to decide what works and what doesn’t, and iron out a solution that should integrate well with the remainder of the codebase.
当对大面积区域进行全面更改时,很容易忘乎所以。例如,我们需要将一个函数的所有调用点迁移
pre_refactor_impl到新函数。整个代码库中post_refactor_impl大约有 300 个实例pre_refactor_impl,跨越 80 多个文件。您可以进行简单的查找和替换,将更改集中到单个提交中,然后将补丁提交给团队成员进行审查。如果迁移相当简单,虽然只创建一组更改可能看起来更方便,但也存在一些严重的缺点。
When making sweeping changes across a large surface area, it’s easy to get carried away. Say, for instance, we need to migrate all callsites of one function,
pre_refactor_impl, to a new one, post_refactor_impl. There are about 300 instances of pre_refactor_impl throughout the codebase, spanning just over 80 files. You could do a simple find and replace, lump the changes into a single commit, and put the patch up for review by a teammate. If the migration is fairly straightforward, although creating just a single set of changes might appear to be more convenient, there are a few severe disadvantages.
首先,提交小规模的增量更改可以更轻松地编写出色的代码。通过推送小规模的提交,您可以尽早从工具中获得相关反馈(例如,通过持续集成在服务器上运行集成测试)。如果您不频繁地推送大范围的更改,则可能需要费力地解决大量测试失败问题。每次提交的修改越多,级联测试失败的可能性就越大;修复一个错误只会揭示另一个错误。保持严格的提交最终可以让您更好地了解它们的影响并更快地修复任何失败的测试。手动验证更改时也是如此。
First, committing small, incremental changes makes it much easier to author great code. By pushing bite-sized commits, you can get relevant feedback early and often from your tooling (e.g., integration tests running on a server through continuous integration). If you push a wide breadth of changes infrequently, you risk needing to wade through and fix a heap of test failures. More modifications per commit leads to a greater likelihood of cascading test failures; fixing one error only reveals another. Keeping tight commits ultimately enables you to understand their impact better and fix any failing tests faster. The same applies when manually verifying changes.
其次,还原小提交比还原大提交容易得多。如果出现问题,无论是在开发过程中还是在代码部署之后,还原小提交都可以让您小心地仅提取有问题的更改。
Second, reverting a small commit is much easier than reverting a big one. If something goes wrong, whether during development or well after the code has been deployed, reverting a small commit allows you to carefully extract only the offending change.
第三,由于简洁的提交往往足够集中,您还可以编写更好、更精确的提交消息。有了更好的提交消息,您不仅能够更快地找到一组特定的更改,而且您的队友在以后浏览版本历史记录时也会更好地理解它们。(小提交通常也会得到更快的审查和批准!)
Third, because concise commits tend to be sufficiently focused, you’ll also be able to write better, more precise commit messages. With better commit messages, not only will you be able to locate a specific set of changes faster, your teammates will understand them better when scanning through the version history at a later date. (Tiny commits typically get reviewed and approved much, much faster, too!)
最后,团队成员几乎不可能充分审查修改后的全部代码。尽管组织不应依赖代码审查来发现错误(而应依赖全面而认真的测试),但如果测试覆盖率不足,发现潜在错误的负担就会落在审查人员身上。从表面上看,这些更改似乎很容易验证,但在审核了其中的几个之后,除非我们保持坚定的专注,否则我们发现差异的能力就会减弱。如果将大型变更集拆分为逻辑组织、简洁的提交,则审查起来要容易得多。
Finally, it is nearly impossible for a teammate to review the entirety of the modified code adequately. Although organizations should not rely on code review to catch bugs (relying instead on thorough and earnest testing), if there is insufficient test coverage, the burden of catching potential mistakes falls to the reviewer. Superficially, the changes may seem easy to verify, but after auditing just a few of them, unless we retain a steadfast focus, our ability to spot discrepancies wanes. Large changesets are far easier to review if split up into logically organized, pithy commits.
重构时,您希望尽可能保留原始版本历史记录。考虑使用诸如git mv移动文件之类的操作,而不是删除它们并重新添加它们。在提交描述中明确说明更改是更大规模重构的一部分,以便工程师在寻找潜在代码所有者时知道要深入挖掘提交历史记录。在为审查代码的队友撰写描述时,请做一位体贴的队友。撰写详尽的描述,概述审查人员应该在变更集中找到什么,以及任何必要的上下文。
When refactoring, you want to maintain the original version history as much as possible. Consider using operations like git mv to move files around rather than deleting them and adding them back. Make it clear in your commit descriptions that the change is part of a larger refactor, so that engineers know to dig deeper into the commit history when looking for a potential code owner. Be a thoughtful teammate when writing descriptions for your teammates reviewing your code. Write a thorough description, outlining what the review should expect to find in the changeset, along with any necessary context.
由于重构涉及逐步重新实现现有行为,因此我们需要确定更改不会改变预期行为。在实践中,验证没有任何变化通常比验证相反情况要困难得多,因此在重构时进行增量和重复测试尤为重要。通过频繁重新运行单元测试、集成测试或进行手动测试,我们可以确认所有内容均未受到影响,或者可以精确地找出行为发生分歧的确切时刻。
Because refactors involve gradually reimplementing existing behavior, we need to ascertain that the changes are not modifying the intended behavior. In practice, it is typically much more difficult to verify that nothing has changed than the opposite, making it particularly important to test incrementally and repeatedly when refactoring. By frequently rerunning unit tests, integration tests, or walking through manual tests, we can either confirm that everything has remained unaffected or pinpoint the precise moment at which the behavior diverged.
在开始修改任何代码段之前,请验证它是否有简洁、独特的单元测试。可能已经有一些测试来断言行为,但您应该花时间确定是否缺少任何其他案例。如果测试太粗略(例如,只测试顶级函数的流程,而没有对任何单个辅助函数进行任何测试),请将它们拆分。细粒度测试,就像细粒度提交一样,将帮助您尽早缩小问题范围。
Before you begin modifying any section of code, verify that there are neat, distinct unit tests for it. There might already be a handful of tests to assert the behavior, but you should take the time to determine whether any additional cases are missing. If the tests are too coarse (e.g., only testing the flow for a top-level function, without any tests for any of the individual helper functions), split them up. Granular tests, just like granular commits, will help you narrow down issues early.
我们都参加过这样的会议:我们和一群高级工程师坐在一起,讨论一项我们不太了解的技术或产品功能。一开始,似乎每个人都在跟着讨论,点头表示同意,只有少数人主导讨论。我们感到困惑,但我们太担心自己看起来没有准备好提出任何澄清问题。这种会议通常最终会走向两个方向。第一种是我们继续安静地坐着,在会议的剩余时间里试图把所有事情拼凑起来,无法对谈话做出有意义的贡献。第二种是其他人插话,礼貌地问出我们不好意思问的那个问题。我们感谢队友的好奇心(感谢我们并不孤单),我们很快就能和其他人一起回到正轨。
We’ve all been in that meeting: the meeting where we sit with a bunch of senior engineers, talking about a technology or a product feature we don’t understand very well. At first, it seems as though everyone’s following along, nodding as a select few lead the discussion. We’re confused, but we’re too worried that we’ll look unprepared to ask any clarifying questions. There are two directions this meeting usually ends up taking. The first is the one in which we continue to sit quietly and spend the rest of the meeting trying to piece everything together, unable to contribute meaningfully to the conversation. The second is the one in which someone else interjects, politely asking the very same question we were too embarrased to ask. We’re thankful for our teammate’s curiosity (thankful we weren’t alone), and we’re able to get back on track with everyone else pretty quickly.
我们不能总是指望好奇的队友有同样的问题,我们也不应该满足于浪费时间坐在会议上或阅读电子邮件,继续思考正在讨论的内容。因此,我建议第三个方向,你站起来,直接问“愚蠢”的问题。通过优先考虑清晰度而不是保持全知的幻觉,你正在为你的团队树立重要的行为榜样。你确认事实上没有一个问题是愚蠢的,最重要的是确保每个人都在同一立场上。你将进行更有成效的讨论,减少误解,并更快地着手解决正确的问题。
We can’t always count on our inquisitive teammates to have the same questions, nor should we be content to waste time sitting in a meeting or reading an email thread, continuing to wonder what is being discussed. So, I propose a third direction, in which you stand up and simply ask the “stupid” question. By prioritizing clarity over maintaining an illusion of omniscience, you are modeling important behavior for your team. You’re affirming that no question is, in fact, a stupid question, and that above all else, it’s important to make sure that everyone is on the same page. You’ll have more productive discussions and fewer misunderstandings, and get to work solving the right problems more quickly.
在大规模重构某些内容时,由于更改的范围可能非常广,因此您很有可能会接触到您不熟悉的代码库部分。勇敢地寻求这些领域的专家并寻求指导至关重要。无论您需要简短的解释还是更深入的演练,对要修改的代码建立牢固的理解都是必不可少的。您不仅可以节省开发时间,在重构时引入更少的错误,还可以获得以最适合代码的方式重构它所需的洞察力。
When refactoring something at scale, because the surface area of the changes can be quite vast, there is a distinct chance that you will come in contact with portions of the codebase you’re unfamiliar with. Being unafraid to seek out the experts in these areas and ask for guidance is crucial. Whether you need a short explanation or a more in-depth walkthrough, it’s imperative to build a strong understanding of the code you’re seeking to modify. Not only will you save on development time, and introduce fewer bugs as you refactor it, you’ll also have the insight necessary to refactor it in a way that best suits the code.
一旦你完成了最后几次提交并整理好了一切,你就可以开始执行最后一项重要任务了。你需要找到方法让你的所有努力长期持续下去。我们的下一章将介绍你的团队可以采取的一些重要步骤,以确保你的代码库不会慢慢退回到以前的状态。
Once you’ve pushed the final few commits and tidied everything up, you’re ready to take on one last, vital task. You need to find ways to make all your efforts persist long-term. Our next chapter will take a look at a few important steps your team can take to ensure that your codebase does not slowly regress to its previous state.
一年多前,我的朋友蒂姆决定完全停止食用糖,以帮助他减掉几磅讨厌的体重并恢复更多精力。第一周很艰难;他感到昏昏欲睡,渴望任何甜食,但到第三周结束时,对糖的戒断反应已经减轻,他又开始感到精力充沛。不久之后,新饮食的好处开始显现:他在整个工作日都感觉更加警觉,并且减掉了几磅体重。
A little over a year ago, a friend of mine named Tim decided to stop consuming sugar altogether to help him shed a few pesky pounds and regain more energy. The first week was tough; he felt lethargic and craved anything sweet, but by the end of the third week, the sugar withdrawal had abated and he began to feel peppy again. Shortly afterward, the benefits of the new diet began to creep in: he felt more alert throughout the workday, and he lost a few pounds.
此后,坚持节食成为他面临的最大挑战。蒂姆曾看到朋友们尝试节食但都以失败告终,因此他知道自己需要为自己设定现实的期望。为了消除诱惑,他禁止在公寓里吃任何甜食。他定期记饮食日记,以监督自己,但与朋友见面时偶尔会吃点零食。在戒糖之旅开始两个月后,他的伴侣也加入了他的行列,他们在一起能够更好地互相支持和鼓励。如今,蒂姆的健康状况好多了,他的精力水平只有他的小狗可以与之媲美。
After that, sticking to the diet was his biggest challenge. Tim had seen his friends try and fail to stick to a diet, so he knew that he needed to set realistic expectations for himself. To eliminate the temptation, he banished any sweet food from his apartment. He kept a regular food journal to keep himself accountable, but allowed himself the occasional treat when meeting up with friends. Two months into his journey, his partner joined him on the sugar-free journey, and together they were able to better support and encourage one another. Today, Tim is in much better health and his energy levels are only rivaled by his puppy.
重构有点像开始新的饮食并坚持下去。虽然看起来最大的挑战似乎是找出要进行的改变并实施它,但同样需要付出巨大的努力才能确保改变持久。在本章中,我们将介绍可以采用的各种工具和实践,以确保我们通过大规模重构所做的改进尽可能持久。我们将研究如何鼓励整个组织的工程师接受重构建立的模式,以及如何使用持续集成来继续促进他们的采用。我们将讨论通过进行重构后路演来教育同事工程师的重要性。最后,我们将讨论如何将渐进式改进融入工程文化,以便希望在不久的将来需要更少的大规模重构。
Refactoring is a bit like taking up a new diet and sticking with it. Although it might seem like the greatest challenge is figuring out the change to make and implementing it, equally significant effort is required to ensure that the change lasts. In this chapter, we’ll look at a variety of tools and practices we can adopt to ensure that the improvements we made with our at-scale refactor are as long-lasting as possible. We’ll examine how to encourage engineers across the organization to embrace the patterns established by the refactor and how to use continuous integration to continue to promote their adoption. We’ll talk about the importance of educating fellow engineers by doing a post-refactor roadshow. Finally, we’ll touch on how to integrate incremental improvement into the engineering culture so that, hopefully, fewer large, at-scale refactors are needed in your near future.
很多时候,大量的工程师需要与你的重构进行交互。你需要这些工程师对重构及其建立的模式的支持,原因有二。
Quite often, a large number of engineers will need to interact with your refactor. You need these engineers’ support for the refactor and the patterns it established for two reasons.
首先是要确保它引入的更改能够长期持续下去。大规模重构可能会引起两极分化;通常,在任何一家由多名员工组成的公司中,都会有狂热的支持者和反对者。如果设计的反对者拒绝按照新的设计/模式编写新代码,他们会想方设法避免这样做,并在您的团队所做的更改和他们自己的代码之间的边界上产生新的垃圾。最终,这种积累可能会使重构的几乎所有好处都变得毫无意义。
The first is to ensure that the changes it introduced persist long-term. Expansive refactors can be polarizing; frequently, within any company of more than just a few individuals, there are both avid supporters and opponents of the chosen design. If the opponents of the design refuse to write new code following the new design/patterns, they’ll find ways to avoid doing so and generate new cruft at the boundary between the changes made by your team and their own code. Ultimately, this build-up could render nearly all the benefits of the refactor meaningless.
即使您计划并执行了高质量的重构,也并非每个人都会理解或同意您的愿景。对于工程团队的新人来说,重构试图解决的问题可能并不十分清楚。当其他工程师没有必要的背景知识来正确理解重构的结果时,他们可能会在重构的边缘工作时遇到困难。他们冒着错误实施重构引入的新模式的风险,或者在代码会从中受益匪浅的情况下完全不使用它们。
Even if you plan and execute a quality refactor, not everyone will understand or agree with your vision. For newcomers to the engineering team, the problems the refactor attempted to solve may not be abundantly clear. When fellow engineers do not have the necessary context to properly appreciate the outcome of a refactor, they may struggle when working at its perimeter. They risk incorrectly implementing the new patterns it introduces, or fail to use them at all in situations when the code would greatly benefit from them.
您需要工程师支持的第二个原因是,让重构所建立的模式能够进一步渗透到整个代码库中。您不仅希望所引入的更改能够保留下来,还希望它们能够为未来几个月甚至几年内从事代码库工作的工程师做出的未来决策提供参考。考虑一个简单的类比:重构就像给一片杂草丛生的菜园除草、翻土并种上几棵葱。维护葱是我们的首要目标,鼓励我们的家庭成员在新补充的土壤中种植其他蔬菜是我们的次要目标。
The second reason you need engineers’ support is to enable the further permeation of the patterns established by the refactor throughout the codebase. You not only want the changes you introduced to remain, you also want them to inform future decisions made by engineers working in the codebase for months, perhaps years to come. Consider a simple analogy: a refactor is just like weeding an overrun vegetable garden, turning over the soil, and planting a few scallions. Maintaining the scallions would be our first goal, and encouraging our family members to plant other vegetables of their own into the newly replenished soil would be our secondary goal.
例如,一个团队重构了其庞大代码库中使用的主要日志库,在工程师意外将个人身份信息 (PII) 泄露到数据处理管道中后,该团队重写了该库的主要接口以拒绝任意字符串。如果开发人员想要记录新字段或创建新的日志类型,他们现在必须在日志库中注册它,然后相应地使用它。该团队决定缩小重构范围,只修改现有库的逻辑以调用新库,而不是替换现有日志库中的每个调用点。
For example, a team refactoring the primary logging library used throughout its extensive codebase, after more than a few mishaps with engineers accidentally leaking personally identifiable information (PII) into their data processing pipelines, rewrote the library’s primary interface to refuse arbitrary strings. If developers wanted to log a new field or create a new log type, they now had to register it in the logging library and then use it accordingly. Instead of replacing each individual callsite in the existing logging library, the team decided to scope down the refactor and simply modify the logic of the existing library to call into the new one.
公司的一些工程师不愿失去记录任意字符串所带来的灵活性。来自以前拥有更灵活日志记录的公司工程师可能也不明白为什么新的日志记录框架会故意引入这些限制。如果不正确地向这些工程师传达您的动机,并与他们合作解决他们的 挫败感,您就有可能让他们找到创造性的方法来绕过新日志记录库中内置的安全措施,从而进一步增加 PII 再次泄露到您的数据处理管道中的风险。
Some engineers at the company were reticent to lose the flexibility that comes with being able to log arbitrary strings. Engineers coming from previous companies with more flexible logging might also be confused about why a new logging framework would purposefully introduce these limitations. Without properly communicating your motivations to these engineers, and working with them to address their frustrations, you risked them finding inventive ways of working around the safeguards built into the new logging library, thus further increasing the risk that PII would be leaked into your data processing pipelines once more.
即使工程师接受重构带来的变化,他们也可能不赞成主动将现有调用站点转换为直接使用新库。他们可能也不愿意向新库添加新的日志字段和类型,而是选择将现有字段和类型用于更广泛的日志,从而降低了它们的特殊性。通过让扩展日志库变得极其容易,然后教工程师如何做到这一点,您将简化他们的过渡,并有望提高整个代码库中新库的整体使用率。
Even if engineers accept the changes brought about by the refactor, they may not be in favor of actively converting existing callsites to use the new library directly. They may also be apathetic about adding new log fields and types to the new library, choosing instead to use existing fields and types for a broader range of logs, thereby diminishing their specificity. By making it extremely easy to extend the logging library, and then teaching engineers how to do so, you’ll ease their transition and, hopefully, increase overall usage of the new library throughout the codebase.
虽然我们可以通过多种方式鼓励整个工程组织采用重构,但根据我的经验,以下方法是最有效的。第一种方法是为工程师构建符合人体工程学的界面,以便在与新重构的代码交互时使用。这些接口应该在项目执行的早期定义,并在整个开发过程中进一步完善。您应该从您的队友和整个工程组织中值得信赖的同事那里收集反馈,了解如何使重构与代码库其余部分之间的界限更加符合人体工程学。如果您已经完成重构,但尚未与未来用户充分审查您的界面,请与来自不同产品领域的几位工程师一起建立一个研讨会,并与他们一起迭代界面。
While there a number of ways we can encourage adoption of the refactor across our engineering organizations, the following methods are the ones that work best in my experience. The first is to build ergonomic interfaces for engineers to use when interacting with the newly refactored code. These interfaces should be defined early in the project’s execution and be further refined throughout development. You should be gathering feedback from both your teammates and trusted peers across the engineering organization on how the boundary between the refactor and the remainder of the codebase could be made more ergonomic. If you’ve wrapped up the refactor and haven’t sufficiently vetted your interfaces with their future users, set up a workshop with a few engineers from distinct product areas and work with them to iterate on the interfaces.
本章将详细介绍重构后最有效的方法。这些方法包括使用您编写的文档向工程师传授重构知识,最后,仔细强化重构引入的任何新模式的使用,以鼓励持续采用。
The methods we’ll look at more closely in this chapter are most effective post-refactor. These include teaching engineers about the refactor using the documentation you’ve crafted, and finally, carefully reinforcing usage of any new patterns introduced by the refactor to encourage continued adoption.
有两种主要方法可以向其他人传授您的重构知识。第一种是主动的;这包括规划和领导研讨会或类似的培训,以积极吸引工程师。第二种是被动的;这包括工程师可以自行完成的分步教程,或通过公司的学习平台提供的简短在线课程。
There are two primary methods of educating others about your refactor. The first is active; this includes planning and leading workshops or similar training to engage actively with engineers. The second is passive; this includes step-by-step tutorials engineers can walk through on their own, or short online courses through your company’s learning platform.
当重构影响到代码库中其他团队工程师经常使用的关键部分时,积极的教育环节就显得尤为重要。习惯于现有模式的工程师需要熟悉全新的做事方式。
An active educational component is most important when the refactor affects a critical portion of the codebase that is used frequently by other engineers from a range of teams. Engineers who are accustomed to an existing set of patterns will need to familiarize themselves with a whole new way of doing things.
确保工程师能够有效使用重构代码的最佳方法之一是与他们一起参加论坛,要求他们通过代码示例进行交互,并在学习如何与重构交互时提出问题。举办研讨会的一大优势是,它鼓励忙碌的工程师特意留出时间来跟上进度;我们中的一些人参与了如此多不同的任务,否则我们永远无法优先考虑了解重构。
One of the best ways to ensure that engineers can work effectively with the refactored code is to engage with them in a forum that requires them to work interactively through code samples and ask questions as they learn how to interface with the refactor. A significant advantage of holding workshops is that it encourages busy engineers to deliberately set aside time to get up to speed; some of us are involved in so many different tasks that we would otherwise never manage to prioritize informing ourselves about the refactor.
积极培训工程师如何与重构交互的时间是在重构刚刚完成时。当存在可能仍处于变动状态或尚未完全清理并准备好供不熟悉重构细节的个人使用的风险时,您不希望工程师进来学习新的代码和模式。在安排您的第一次研讨会之前,请花时间验证一切是否井然有序。更好的做法是,在向同事开放之前,与您的团队一起进行一次研讨会的试运行,以消除任何问题。
The time to educate engineers actively about how to interface with the refactor is once it’s been newly completed. You don’t want engineers coming in to learn new code and patterns when there’s a risk that it might still be in flux or it hasn’t yet been fully cleaned up and prepared for use by individuals who aren’t intimately familiar with the details of the refactor. Take the time to verify that everything is in order before scheduling your first workshop. Better yet, do a dry run of the workshop with your team to iron out any kinks before opening it up to your peers.
这些会议不应该永久举行。理想情况下,在几个月内,大多数受重构影响最大的工程师应该会熟悉它。到那时,重构的代码将成为新常态,对理解它的帮助需求应该会大幅减少。考虑只举办两到三次研讨会,并关注兴趣水平和随后的出席情况。现场培训虽然可能很有吸引力,但会占用您的团队大量时间,因此应该只举办几次。如果在几次会议之后需求仍然存在,您可能需要投资改进文档并更多地依赖它。
These sessions shouldn’t be held in perpetuity. Ideally, within a few months, most of the engineers most significantly affected by the refactor should be well acquainted with it. At that point, the refactored code becomes the new normal, and demand for help understanding it should dramatically decrease. Consider holding just two or three workshops, and keep an eye on the interest level and subsequent attendance. Live trainings, as engaging as they might be, are incredibly time consuming for your team and should be held only a handful of times. If demand continues after more than just a handful of sessions, you may want to invest in improving your documentation and leaning on it more heavily.
实际上,由于几乎每个工程师都在其常规工作流程中使用日志记录,因此我们之前的示例非常适合用于培训课程。以下是其结构:
In practice, because just about every engineer uses logging in their regular workflow, our previous example would a perfect candidate for a training session. Here’s how it could be structured:
快速概述重构的目标。为了有效地传达重构的影响并激励您的同事利用重构,请通过最引人注目的示例进行讨论。例如,对于日志库,您可以展示一些导致过去几个月泄露 PII 的误导性日志语句;然后,演示如何使用新的日志库来完全防止这些信息被泄露。
Give a quick overview of the goals of the refactor. To communicate its impact effectively and excite your coworkers to take advantage of it, talk through the most compelling examples. With the logging library, for instance, you might show a few misleading log statements responsible for leaking PII over the past few months; then, demonstrate how to use the new logging library to prevent this information from being leaked altogether.
接下来,为了巩固这些概念,将与会者分成两组,并要求他们迁移相同的简单日志语句以使用新库。回答他们提出的任何问题。这里可能存在多个解决方案;如果有,请让每组解释他们不同的解决方案。
Next, to cement these concepts, pair up the attendees and ask them to migrate the same simple log statement to use the new library. Answer any questions as they arise. There may be more than one solution here; if there is, have the pairs explain their distinct solutions.
最后,让每组选择一个更复杂的日志语句进行迁移,最好是需要扩展日志库的语句(通过添加新的日志类型或字段类型)。与每个小组核对并回答他们可能遇到的任何问题。
Finally, have the pairs choose a more complex log statement to migrate, ideally one that requires extending the log library (by either adding a new log type or field type). Check in with each group and answer any questions they might have.
办公时间也可以成为积极教育同事的同样有用的论坛。它们为工程师提供了一个开放的机会,可以顺便拜访你和你的团队,向你和你的团队询问有关重构及其在特定用例中的采用的问题。并不是每个与你的重构互动的人都有时间(或兴趣)参加研讨会;在办公时间,他们可以得到你团队的全神贯注,这将使他们更有可能在采用重构实施的更改时获得积极的体验。此外,以前的研讨会参与者可以顺便过来并在必要时获得额外的指导。
Office hours can be an equally helpful forum for actively educating your colleagues. They give engineers an open opportunity to drop by and ask you and your team questions about the refactor and its adoption in their specific use cases. Not everyone who will interact with your refactor will have time (or interest) to attend a workshop; having office hours when they can have your team’s undivided attention will make them more likely to have a positive experience adopting the changes implemented by the refactor. Furthermore, previous workshop attendees can drop by and get additional guidance if necessary.
安排办公时间的优点之一是,它使您的团队能够限制回答与重构有关的问题所花费的时间。重构结束后,您的团队可能会开始收到来自公司各地同事的大量请求。如果您不合理安排时间,这些问题很容易会占据您的注意力(更不用说频繁切换上下文会打乱您的一天)。通过将所有非紧急请求转移到您的办公时间,您可以保护团队的时间和注意力。
One of the advantages of hosting office hours is that it enables your team to time-box the amount of time they spend answering questions pertaining to the refactor. Your team may start to get bombarded with requests from colleagues across the company as soon as the refactor wraps up. If you aren’t judicious with your time, these questions could easily monopolize your attention (not to mention disrupt your day with frequent context-switching.) By diverting all nonurgent requests to your office hours, you are protecting your team’s time and focus.
记录您的团队在办公时间内处理的问题和疑虑,并利用这些问题和疑虑编写常见问题解答。此文档将帮助您的团队节省宝贵的时间,无需在办公时间以外重复回答相同的问题。
Keep track of the questions and concerns your team addresses during these office hours and use these to write an FAQ. This document will help save your team valuable time repeatedly answering the same questions both during office hours and beyond.
许多工程团队定期举办公开论坛(例如,周四下午的酒会和演示,或每两周一次的午餐和学习),工程师们可以在论坛上介绍他们所领导的工作。大型重构项目通常会带来许多有趣的故事:团队发现的令人难以置信的、负载过大的错误、15 年前最后一次修改代码时的可怕遭遇、部署出错。我们大多数人都真心喜欢听别人讲述我们在共享代码方面的经历,而且我们往往会清楚地记得特别好的故事。
Many engineering groups host regular open forums (e.g., Thursday afternoon Drinks and Demos, or bi-weekly Lunch and Learn) where engineers can present about the work they’re spearheading. Large refactoring projects often come with a number of interesting stories: the mind-boggling, load-bearing bug the team uncovered, the terrifying encounter with code last modified 15 years ago, the deploy gone wrong. Most of us genuinely enjoy hearing one another’s stories about our experiences in the code we share, and we tend to vividly remember the particularly good ones.
报名与同事进行简短的演讲,介绍重构中引人注目的部分,让他们了解该项目,并好奇地了解其动机以及他们如何在代码库领域从中受益。有时,一点点精彩的故事讲述就是您获得同事支持所需的全部宣传。
Sign up to give a short talk to your peers about a compelling portion of the refactor to make them aware of the project and curious to learn more about its motivations and how they might benefit from it in their areas of the codebase. Sometimes, a little bit of great storytelling is all the publicity you need to garner the support of your fellow engineers.
在第 7 章中,我们讨论了文档的重要性:不仅要在整个重构过程中制作详尽的文档,还要选择适合您团队的媒介和组织方案。一旦您进入重构的最后阶段,您的团队应优先编写文档,描述重构的目的以及重构如何使在同一代码库中工作的其他工程师受益。根据我们在第 7 章中的讨论,您或您的团队制作的任何文档都应添加到您的真实来源目录中。
In Chapter 7, we discussed the importance of documentation: not only the importance of producing thorough documentation throughout the refactoring process, but also the importance of choosing a medium and organization scheme that works well for your team. Once you’ve reached the final stages of the refactor, your team should prioritize crafting documentation describing the intent of the refactor and how it can benefit fellow engineers working within the same codebase. Per our discussion in Chapter 7, any documentation you or your team produces should be added to your source-of-truth directory.
此文档可以采用多种形式:它可以是常见问题解答、提供项目目标高级摘要的简短自述文件或教程。拥有可以向好奇的工程师指出的文档可以帮助您的团队节省回答问题的时间。如前文“办公时间”中所述,您的团队可能需要回答公司内同事提出的大量问题。您的团队可以向他们指出准备好的文档,而不是逐一回答每个人的问题。
This documentation can take a number of forms: it can be an FAQ, a short README providing a high-level summary of the project’s goals, or a tutorial. Having documentation you can point curious engineers to helps your team save time answering questions. As previously mentioned in “Office hours”, your team will likely need to answer a multitude of questions from peers throughout the company. Instead of answering everyone individually, your team can instead point them to prepared documentation.
如果您打算编写一份关于重构后代码库导航的操作指南,我建议从历史角度编写;也就是说,将其置于重构的故事中,从一开始到世界的现状结束。通过从这样的角度讨论重构,您可以防止文档立即过时。尽可能添加日期以向读者提供适当的背景信息(即使像一年这样宽泛的时间也可以)。让我们使用日志示例来说明这一点。
If you intend to write a how-to guide on navigating the codebase post-refactor, I recommend writing it from a historical perspective; that is, ground it in the story of the refactor, starting from the very beginning and concluding with the current state of the world. By discussing the refactor from such a perspective, you can prevent your documentation from immediately becoming outdated. Whenever possible, add dates to give readers appropriate context (even something as broad as a year may suffice). Let’s illustrate this, using our logging example.
首先,向读者介绍您和您的团队在寻求改进代码之前花时间了解代码性能下降的原因(参见第 2 章)。以我们的日志库为例,首先概述初始设计以及设计时做出的决策。谈谈作者希望该库轻量且易于使用,并允许任何人(谨慎地)方便地记录几乎任何内容。
Start by giving readers the insight that you and your team acquired by spending the time understanding why the code had degraded before you sought to improve it (see Chapter 2). In the case of our logging library, begin by giving an overview of the initial design and the decisions that informed that design. Talk about how the authors wanted the library to be lightweight and easy to use, and allow anyone to (carefully) log just about anything conveniently.
讨论随着产品变得越来越复杂,越来越多的工程师加入团队,泄露 PII 的风险也随之增加。列出最近发生的严重泄露事件,并表明最近几个月泄露的频率不断增加。
Discuss how that as the product became more complex, and more engineers joined the team, the risk of leaking PII increased. List recent, serious instances when leaks occurred, demonstrating a growing frequency in recent months.
描述您的解决方案以及它如何防止 PII 泄露。使用旧日志库和新日志库比较和对比相同的日志语句。尽量避免使用“现在”、“当前”或“今天”等词语。虽然您可能从自己的角度概述了代码当前的运行方式,但代码很有可能会继续发展。通过在解释前加上“截至 2020 年 9 月”之类的字眼,而不是“今天”,您就可以为文档做好面向未来的准备。
Describe your solution and how it inhibits PII from being leaked. Compare and contrast the same log statement, using both the previous and new logging libraries. Try to avoid using words like “now,” “currently,” or “today.” Although you may be outlining how the code presently functions from your perspective, there is a strong chance that the code will continue to evolve. By prefacing your explanations with something like “as of September 2020,” instead of “today,” you are future-proofing your documentation.
积极鼓励是一种强大的工具。无论与项目的距离有多远,整个公司的开发人员都需要被提醒它所建立的模式(可能不止一次)。在这里,我们有两个更广泛的选择。您可以采用我们在“激励个人”中描述的许多激励策略来表彰那些在采用重构中建立的模式方面做得非常出色的工程师。看到您的同事因其贡献而受到公开赞扬,可以导致开发人员的采用率迅速提高。
Positive reinforcement is a powerful tool. Regardless of proximity to the project, developers across the company will need to be reminded of the patterns established by it (and probably more than once). Here, we have two broader options. You can employ many of the motivational tactics we described in “Motivating individuals” to recognize engineers who are doing a great job of adopting the patterns established in your refactor. Seeing your coworkers being publicly praised for their contributions can lead to a rapid increase in adoption by developers far and wide.
第二种选择是使用持续集成自动强化开发流程。通过持续集成,当作者推送新的提交、表明其代码已准备好进行审查或准备将其更改与主要开发分支合并时,我们可以启动许多流程。典型的设置将通过运行一系列测试以及 lint 和代码分析工具来验证更改。我们将研究 lint 和代码分析器,然后考虑如何配置这些工具,以有效地让您的团队无需积极鼓励和监控采用情况。
A second option is to automate reinforcement in the development process with continuous integration. With continuous integration, we can kick off a number of processes when an author pushes a new commit, indicates that their code is ready for review, or prepares to merge their changes with the main development branch. A typical setup will verify changes by running a series of tests alongside lints and code analysis tools. We’ll look at both linting and code analyzers and then consider the ways in which you can configure these tools to effectively free your team from needing to actively encourage and monitor adoption.
渐进式 linting 允许您通过仅对新编写或修改的代码强制执行规则来逐步改进代码库。这使开发人员能够在问题出现时慢慢解决问题,而不是要求一两个工程师修补每个违反规则的情况。如果您的团队正在用一种模式替换另一种模式,那么编写新的(渐进式)linter 规则是一种简单的方法,可以促使开发人员使用较新的模式并防止已弃用模式的传播。
Progressive linting allows you to improve a codebase gradually by only enforcing rules on newly written or modified code. This enables developers to address problems slowly as they arise rather than requiring one or two engineers to patch every instance where the rule would be violated. If your team is replacing one pattern with another, writing a new (progressive) linter rule is an easy way to nudge developers to use the newer pattern and prevent propagation of the deprecated pattern.
例如,作为日志库重构的一部分,您的团队希望消除对 的引用logEvent,这允许提取任意字符串,而使用
logEventType,它仅记录特定的非 PII 数据。您的团队可以编写一条新的 linter 规则,禁止对 的任何新使用logEvent,并发出一条错误消息,通知工程师该函数已被弃用并鼓励他们改用logEventType。
For example, as part of the logging library refactor, your team wants to eradicate references to logEvent, which allows for arbitrary strings to be ingested, in favor of
logEventType, which only logs specific, non-PII pieces of data. Your team could write a new linter rule that bans any new usage of logEvent, with an error message informing engineers that the function is deprecated and encouraging them to use logEventType instead.
有些工程师对遇到意外的 linter 故障非常敏感。一定要充分传达新 linter 规则的目标以及生效时间,以免任何人感到意外。在错误消息中添加尽可能多的上下文,以便遇到错误的工程师无需查找任何其他文档来修复它。
Some engineers are very sensitive about encountering unexpected linter failures. Be certain to adequately communicate the goal of the new linter rule and when it will come into effect so that no one is surprised. Add as much context to the error message as possible so that engineers hitting the error don’t need to pull up any additional documentation to fix it.
并非所有语言都具有允许开发人员编写自定义规则的可扩展 linter,甚至很少有语言内置渐进式 linting 功能。一些工程团队投资于内部构建这些工具(在某些情况下,后来开源他们的解决方案)。如果您正在使用可扩展 linter,并且能够编写自定义规则,那么引入渐进式 linting 的快速方法是仅对给定提交中的修改文件或仅对代码差异本身运行 linter。
Not all languages have extensible linters that allow for developers to write custom rules, and even fewer have progressive linting capabilities built in. Some engineering teams invest in building these tools internally (and, in some cases, later open-source their solutions). If you are using an extensible linter, and are able to write custom rules, a quick way to introduce progressive linting is by running the linter either only on modified files in a given commit or only on the code difference itself.
第 3 章中介绍的许多指标都可以使用集成时触发的现成代码分析工具随时间进行监控。有各种免费和付费的开源解决方案可以自动计算不同规模(单个函数、类、文件等)的代码复杂度并生成测试覆盖率统计数据。其中许多解决方案都易于扩展,因此您的团队可以开发和加入自己的指标计算并随着时间的推移断言新规则。
Many of the metrics covered in Chapter 3 can be monitored over time, using out-of-the-box code analysis tools triggered at integration time. There is a wide range of both free and paid open-source solutions that will automatically calculate code complexity at different scales (individual functions, classes, files, etc.) and generate test coverage statistics. Many of these solutions are easily extendable so that your team can develop and hook in its own metrics calculations and assert new rules as time goes on.
例如,假设您的团队希望确保代码库中的任何函数不超过 500 行。您的团队可以配置所选的代码分析工具,以便在更改导致函数超过该阈值时发出警告或抛出错误。如果工程师在现有函数中添加了几行代码,将其行数从 490 增加到 512,他们会被要求在合并更改之前将该函数拆分为较小的子函数。
For example, say your team wants to ensure that no function in the codebase exceeds 500 lines. Your team could configure your chosen code analysis tool to warn or throw an error whenever a change causes a function to cross that threshold. If an engineer comes along and adds a few lines to an existing function, increasing its line count from 490 to 512, they’d be nudged to split up the function into smaller subfunctions before merging their changes.
我们的集成流程中配置的每个验证步骤都可以是一道大门,阻止更改继续向前推进,也可以是一道护栏,在继续之前向代码作者发出警告。
Each verification step configured in our integration flow can either be a gate, preventing the changes from continuing to move forward, or a guardrail, producing a warning for the code author to consider before proceeding.
过多的门限对工程组织来说可能是有害的:它们会减慢开发速度,并会让工程师感到沮丧(尤其是在意外的情况下)。假设您的组织配置了 10 个阻塞测试套件。当开发人员准备将他们的代码提交审查时,测试套件会并行启动。不幸的是,这些套件中大约有一半需要 10 分钟以上的时间才能运行,其中一些套件经常产生不稳定的结果。工程师们正在花费宝贵的时间等待他们的代码清除这 10 个门限中的每一个。
Too many gates can be detrimental to an engineering organization: they slow down development and can frustrate engineers (especially if they are unexpected). Say your organization has configured 10 blocking test suites. When a developer is ready to put their code up for review, the test suites kick off in parallel. Unfortunately, about half of these suites take just over 10 minutes to run, and a few of them regularly produce flaky results. Engineers are spending valuable time waiting for their code to clear each of these 10 gates.
现在假设组织不再设置门槛,而是设立护栏;也就是说,团队不再让每个测试套件阻碍进度,而是决定哪两个或三个是真正对业务至关重要的预合并套件,并将其他套件标记为可选套件。工程师现在负责确定他们认为哪些套件对他们的变更最重要,如果结果不稳定,他们可以选择忽略它们。当然,选择更多护栏也有其自身的风险,也许更多的错误可能会进入生产环境,但总的来说,我认为我们应该更加信任我们的工程师同事。
Now suppose that instead of setting up gates, the organization instead institutes guardrails; that is, instead of having each of these test suites block progress, the team decides which two or three are truly business-critical premerge, and labels the others as optional. Engineers are now responsible for determining which suites they believe to be most important to their changes, and if the results are flaky, they can choose to ignore them. Of course opting for more guardrails comes with its own risks, and perhaps more bugs may make it out into production, but by and large, I’m of the opinion that we should be trusting our fellow engineers more.
只要我们无法预测技术或需求的变化将如何继续影响我们的系统,就永远需要大规模重构。然而,我确实相信一些大规模重构是可以避免的,我们应该尽最大努力尽可能地防止它们。在我们结束本章时,我想给你留下一些关于如何建立持续改进文化的想法。通过不断确定和利用切实改进代码的机会,我们希望能够更长时间地避免雄心勃勃、破坏性的重构。
There will always be a need for large-scale refactors, as long as none of us can predict how shifts in technologies or requirements will continue to affect our systems. However, I do believe that some large-scale refactors are avoidable, and that we should do our best to prevent them when possible. As we conclude this chapter, I want to leave you with some thoughts on how to build a culture of continuous improvement. By perpetually pinpointing and taking advantage of opportunities for tangibly improving our code, we can hopefully ward off ambitious, disruptive refactors for a while longer.
首先,维护健康代码库的最佳方法之一就是在有机会时继续有意识地重构小的、包含良好的代码部分。我们不想成为路过的重构者(参见“因为你碰巧路过”),而是专注于逐步改进我们自己团队拥有和维护的代码库领域。我们总是有很多机会整理自己的社区。当我们遇到另一个团队改进代码的机会时,我们可以伸出援手,倾向于提出问题以更好地了解他们的问题,而不是立即提出解决方案。共同努力制定更清晰的 实现。
First and foremost, one of the best ways to maintain a healthy codebase is simply to continue deliberately refactoring small, well-contained portions of code as you encounter the opportunity. We do not want to become drive-by refactorers (see “Because You Happened to Be Passing By”), but instead focus on incrementally improving areas of the codebase owned and maintained by our own team. There are always plenty of opportunities for us to tidy up in our own neighborhood. When we encounter an opportunity for another team to improve their code, we can reach out, leaning toward asking questions to understand their problems better, rather than immediately proposing a solution. Work together to craft a cleaner implementation.
我们应该经常鼓励和促进团队进行设计对话,尽早寻求他人的反馈,而不是独自推进。代码审查不仅是别人检查我们工作的机会,也是公开讨论如何使我们的解决方案更好的机会。作为代码作者,我们应该考虑在代码审查中为审阅者提供具体问题。作为审阅者,我们在审阅同事的代码时应该像自己编写代码时一样具有分析能力。
We should encourage and facilitate design conversations on our team frequently, seeking others’ feedback early rather than forging ahead on our own. Code reviews are not only an opportunity for someone to double-check our work, but also a chance for an open discussion about how we can make our solution just that much better. As code authors, we should consider annotating our code reviews with specific questions for our reviewers. As reviewers, we should be just as analytical when reviewing our peers’ code as we are when we are writing code ourselves.
最后,在功能开发过程的早期进行包容性设计评审。这意味着邀请来自各个背景的工程师来评估您的设计并提出问题。您的评审人员应该涵盖所有经验和资历水平;他们应该包括来自各种背景的个人。您能够收集的观点越多样化,您就越有可能尽早发现致命缺陷,最终,您就越有可能设计出更优越的解决方案。
Finally, hold inclusive design reviews early in the feature development process. This means inviting engineers from all backgrounds to evaluate your designs and ask questions. Your reviewers should span all experience and seniority levels; they should include individuals from a wide range of backgrounds. The more diverse perspectives you are able to gather, the more likely you’ll be able to spot fatal flaws early and, ultimately, the more likely you’ll be able to architect a far superior solution.
下次坐下来工作时,请认真思考一下今天所做的事情是否会导致以后的大规模重构。有时,我们只需要一点提醒,让我们意识到我们的决定可能带来的长期后果,就能让我们回到正确的方向。
Whenever you next sit down to work, think critically about how what you do today might or might not lead to a large-scale refactor later. Sometimes, all we need is a little reminder of the potential long-term consequences of our decisions to steer us back in the right direction.
在深入研究我们的案例研究之前,让我先介绍一下 Slack:产品的历史、公司的历史及其早期的影响。
Before I dive into our case studies, let me set the stage by telling you a little bit about Slack: the history of the product, the company, and its early influences.
Slack 是温哥华一家名为 Tiny Speck 的小型游戏公司的内部工具。该团队由来自 Flickr 的工程师、设计师和产品人员组成,他们试图打造一款专注于社区建设的梦幻般的大型多人在线游戏。他们称之为 Glitch。
Slack was developed as an internal tool at a small gaming company based out of Vancouver called Tiny Speck. The team, a mash-up of engineers, designers, and product people from Flickr, sought to build a fantastical, massively mulitplayer online game focused on community building. They called it Glitch.
由于每个人都分布在北美各地,Tiny Speck 开始严重依赖互联网中继聊天 (IRC) 进行交流。不久之后,团队意识到需要更强大的东西:一个可以使其保持异步联系、搜索消息历史记录和发送文件的工具。成员们开始着手构建它。
Because everyone was distributed across North America, Tiny Speck began to rely heavily on internet relay chat (IRC) to communicate. Before long, the team realized that it needed something a bit more powerful: a tool that enabled it to keep in touch asynchronously, search through message history, and send files. The members set out to build it.
这款游戏最终于 2012 年停产,公司解雇了大部分员工,但 Tiny Speck 还保留了最后一招。在一次意想不到的转变中,少数剩余的员工选择将他们的内部沟通工具商业化。他们完善了体验,并将其命名为 Slack:可搜索所有对话和知识的日志。
The game ultimately shut down in 2012, and the company laid off most of its employees, but Tiny Speck had one final trick up its sleeve. In an unlikely pivot, the few remaining employees chose to commercialize their internal communications tool. They polished the experience and branded it Slack: searchable log of all conversation and knowledge.
Tiny Speck 团队联系了朋友和前同事来测试新工具。随着每一批新用户的加入,团队都会收集反馈、修复错误并构建新功能。到 2013 年 5 月,该产品已准备好发布预览版,只有少数请求邀请的人才能使用。仅仅九个月后,Slack 就公开发布了。
The Tiny Speck crew contacted friends and past colleagues to test out its new tool. With each new batch of users, the team collected feedback, fixed bugs, and built new functionality. By May 2013, the product was ready for a preview release, available to a select few who requested invitations. Just nine months later, Slack launched publicly.
使用量猛增。在一年之内,该工具的日活跃用户数从不到 1.5 万增加到 50 万。到产品推出两周年时,每天使用 Slack 的用户已超过 230 万。到 2019 年末,距离推出近六年,该数字已超过 1200 万,每周发送的消息超过 10 亿条。
Usage skyrocketed. Within a year, the tool went from having just under 15,000 daily active users to 500,000. By the time the product hit its two-year anniversary, more than 2.3 million users were using Slack every day. In late 2019, nearly six years from launch, that number exceeded 12 million, with more than 1 billion messages sent every week.
Slack 早期的许多技术和设计决策都源自创始人创建 Flickr 和 Glitch 的经验。例如,考虑到他们在 2004 年创建照片共享网站的经验,使用 PHP 和 MySQL 是合乎逻辑的。事实上,Slack 的许多基本服务器功能都源于 Flamework,这是一个 PHP Web 应用程序框架,源自 Flickr 开发的流程和内部风格;您可以在GitHub上找到它。许多实时消息传递基础设施直接源自 Tiny Speck 类似 IRC 的内部工具。
Many of Slack’s early technology and design decisions were informed by the founders’ experience building Flickr and Glitch. The usage of PHP and MySQL, for instance, was a logical one, given their experience building the photosharing website in 2004. In fact, much of Slack’s basic server functionality has its roots in Flamework, a PHP web-application framework, borne out of the processes and house style developed at Flickr; you can find it on GitHub. Much of the real-time messaging infrastructure was derived directly from Tiny Speck’s IRC-like internal tool.
2016 年初,Slack 开始研究 Zend Engine II PHP 解释器的替代方案。主要有两个选择:升级到 PHP 7 并使用 Zend Engine III,或者尝试 Facebook 的 HipHop 虚拟机 (HHVM)。经过一番深思熟虑,领导层决定将 HHVM 运行时推广到其 Web 服务器。推广成功后,工程团队开始采用 Hack 编程语言,这是一种逐渐类型的 PHP 方言,开发用于在 HHVM 上运行。截至本文发表时,Slack 代码库中曾经用 PHP 编写的部分现在用 Hack 编写。
In early 2016, Slack began to look at some alternatives to the Zend Engine II interpreter for PHP. There were two main contenders: upgrade to PHP 7 and use Zend Engine III, or try Facebook’s HipHop Virtual Machine (HHVM). After some deliberation, leadership decided to roll out the HHVM runtime to its web servers. Once the rollout proved successful, the engineering team began to adopt the Hack programming language, a gradually typed dialect of PHP developed to run atop HHVM. At the time of publication, the portion of Slack’s codebase that was once written in PHP is now written in Hack.
本节中的两个案例研究都将重点关注对用 PHP 和后来的 Hack 编写的代码库部分进行的大规模重构工作。为了尽可能地传达每个问题的性质,这些部分中的代码示例将使用 Hack。但别担心!虽然这些代码片段有助于提供我们正在解决的问题的具体小例子,但它们并不是故事的重点。大规模重构主要涉及流程和所涉及的人员,而不是代码本身,我希望这些案例研究有助于说明这一点。如果您仍然担心能否解析代码示例,请让我向您保证,当时,Hack 代码看起来仍然很像 PHP。对于那些不熟悉 Hack 或 PHP 的人,我们将详细介绍每个代码片段,以便您了解情况。
Both of the case studies in this section will focus on large refactoring efforts carried out on the portion of the codebase written in PHP and, later, Hack. To convey the nature of each problem as well as possible, the code samples in these sections will be in Hack. But don’t worry! While the snippets help provide small, concrete examples of the problem we were tackling, they are not the focus of the story. Refactoring at scale is primarily about the process and the people involved rather than the code itself, and I hope that these case studies help illustrate exactly that. If you’re still concerned about being able to parse the code samples, let me reassure you that at the time, Hack code still looked quite a bit like PHP. For those who aren’t comfortable with either Hack or PHP, we’ll walk through each snippet in detail so that you can get your bearings.
在我们继续之前,我想提请大家注意最后一项观察。截至本文发表时,Slack 才刚刚上市六年。代码、产品和公司都相对年轻。代码必须快速扩展才能处理日益增长的客户使用量以及越来越多的开发产品工程师。多年来,公司内部开始的许多大规模重构工作都是为了应对高速增长,无论是由于高采用率而导致的外部增长,还是由于招聘而导致的内部增长。
I’d like to draw attention to one final observation before we move on. At the time of publication, Slack has only been publicly available for six years. The code, the product, and the company are all relatively young. The code has had to scale rapidly to handle increasing customer usage as well as a growing number of engineers developing the product. Many of the large refactoring efforts that have begun throughout the company over the years have been in response to hypergrowth, both external due to high adoption and internal due to hiring.
在我们的两个案例研究章节中的第一章中,我们探讨了我在 Slack 工作的第一年与团队的其他几名成员进行的重构。该项目以整合两个冗余数据库模式为中心。这两种模式都与我们日益笨重的代码库紧密相关,我们几乎没有可依赖的单元测试。简而言之,这个项目是一个很好的现实、大规模重构的例子,这家公司相对年轻、高增长,工程师人数不多,代码库日益笨重。
For the first of our two case study chapters, we explore a refactor I carried out with a few other members of my team during my first year at Slack. The project centered on consolidating two redundant database schemas. Both schemas were tightly coupled to our increasingly unwieldy codebase, and we had very few unit tests to rely on. In short, this project is a great example of a realistic, large-scale refactor at a relatively young, high-growth company with a modest number of engineers and an increasingly unwieldy codebase.
该项目之所以成功,主要因为我们始终高度专注于合并冗余数据库表的最终目标。我们起草了一个简单但有效的执行计划(第 4 章),仔细权衡了风险和执行速度,以便及时交付解决方案。我们选择了一种轻量级的方法来收集指标(第 3 章),将重点放在几个关键数据点上。每当我们完成一个新的里程碑时,我们都会主动向整个工程团队广泛传达我们的变化(第 7 章)。我们构建了工具来确保我们的更改能够持久(第 9 章)。最后,我们通过在重构完成后仅几周就无缝发布在新合并模式之上的新功能,成功地证明了重构的价值。这使我们能够获得更多支持,以启动进一步的重构(第 5 章)。
This project was successful primarily because we remained hyperfocused on our ultimate goal of consolidating the redundant database tables. We drafted a simple but effective execution plan (Chapter 4), thoughtfully weighing risk and speed of execution to deliver on our solution promptly. We opted for a lightweight approach to gathering metrics (Chapter 3), choosing a narrow focus on just a few key data points. We proactively communicated our changes widely, across the entirety of the engineering team, whenever we completed a new milestone (Chapter 7). We built tooling to ensure that our changes would persist (Chapter 9). Finally, we successfully demonstrated the value of the refactor by seamlessly shipping a new feature built atop the newly consolidated schema just weeks after its completion. This enabled us to get further buy-in to kick off further refactors (Chapter 5).
虽然重构带来了我们期望的性能改进,但我们在此过程中也犯了一些错误。由于来自我们最重要的客户的巨大压力,我们急于开始取得进展;我们没有调查模式融合的原因,也没有将我们的计划写出来供其他团队轻松使用(第 4 章)。我们没有寻求更广泛的跨职能支持(第 5 章),把大部分工作留给了我们的小团队。即便如此,我们仍难以保持势头,重构在最后几周拖延了(第 8 章)。
Although the refactor yielded the performance improvements we sought, we took a few missteps along the way. Due to significant pressure from our most important customers, we rushed to start making headway; we did not investigate why the schemas had converged, nor commit our plan to writing for other teams to consume easily (Chapter 4). We didn’t seek broader, cross-functional support (Chapter 5), leaving the bulk of the work to our small team. Even then, we struggled to keep up the momentum, and the refactor dragged in its final few weeks (Chapter 8).
然而,在深入研究重构本身之前,必须了解 Slack 的功能及其工作原理。如果您不熟悉该产品,我强烈建议您仔细阅读本节。如果您是 Slack 的普通用户,请直接跳至“Slack 架构 101”。
Before we dive into the refactor itself, however, it’s imperative to understand what Slack does and the basics of how it works. If you aren’t familiar with the product, I strongly recommend giving this section a thorough read. If you’re a regular Slack user, feel free to skip ahead to “Slack Architecture 101”.
Slack 首先是适用于各种规模和行业的公司的协作工具。通常,企业会设置一个 Slack 工作区并为每个员工创建用户帐户。作为员工,您可以下载该应用程序(在您的台式机、手机或两者上)并立即开始与您的团队成员沟通。
Slack is first and foremost a collaboration tool for companies of all sizes and industries. Typically, a business will set up a Slack workspace and create user accounts for each employee. As an employee, you can download the application (on your desktop machine, your mobile phone, or both) and immediately begin communicating with your teammates.
Slack 将主题和对话组织到频道中。假设您正在开发一项新功能,使用户能够更快地将文件上传到您的应用程序中。我们将该项目称为“更快的上传”。您可以创建一个新的频道名称 #feature-faster-uploads,您可以在其中与其他工程师、您的经理和产品经理协调开发。公司中任何想知道“更快的上传”开发进展的人都可以导航到 #feature-faster-uploads 并阅读最近的历史记录或加入对话并直接向团队提问。
Slack organizes topics and conversations into channels. Let’s say that you’re working on a new feature that enables your users to upload files into your application faster. We’ll call the project “Faster Uploads.” You can create a new channel name, #feature-faster-uploads, where you can coordinate development with fellow engineers, your manager, and product manager. Anyone at the company curious to know how development is going on “Faster Uploads” can navigate to #feature-faster-uploads and read through the recent history or join the conversation and ask a question to the team directly.
在图 10-1中,您可以看到 2017 年上半年(大约在第一个案例研究期间) Slack 界面的简单示例。
You can see a simple example of what the Slack interface looked like during the first half of 2017, around the time of this first case study, in Figure 10-1.
这里,我们的示例用户是 Acme Sites 的员工 Matt Kump。您可以在左上角看到我们当前正在查看的工作区的名称,而 Matt 的名字则位于其正下方。
Here, our example user is Matt Kump, an employee of Acme Sites. You can see the name of the workspace we’re currently viewing at the top left, and Matt’s name immediately below it.
最左侧的侧边栏包含 Matt 的所有频道。我们暂时忽略带星号的部分,首先关注频道部分。从这个列表中我们可以看到,Matt 参与了有关会计成本(#accounting-costs)、头脑风暴(#brainstorming)、业务运营(#business-ops)和其他一些话题的对话。这些频道都是公开的,这意味着任何在 Acme Sites 拥有帐户的人都可以发现该频道、查看其内容并加入它。
The leftmost sidebar contains all of Matt’s channels. We’ll ignore the starred section for now and focus on the Channels section first. We can see from this list that Matt is involved in conversations about accounting costs (#accounting-costs), brainstorming (#brainstorming), business operations (#business-ops), and a handful of others. Each of these channels is public, meaning that anyone with an account at Acme Sites can discover the channel, view its contents, and join it.
您可能已经注意到 #design-chat 频道有一个小锁,而其他频道都有 # 符号。这表示该频道是私人频道。只有私人频道的成员才能发现它并查看其内容。要加入私人频道,您必须受到已经是成员的人的邀请。
You might have noticed that the #design-chat channel has a little lock where the others have the # symbol. This indicates that the channel is private. Only users who are members of the private channel can discover it and view its contents. To join a private channel, you must be invited by someone who is already a member.
侧边栏下方是 Matt 的私信列表。我们可以看到他与队友(如 Brandon、Corey 和 Fayaz)进行了多次一对一的私信对话。他还与 Lane 和 Pavel 进行了群聊;群聊就像私信一样,只不过是与几位队友(而不是一位)进行私信。
Farther down the sidebar is Matt’s list of Direct Messages. We can see that he’s in a number of direct, one-on-one conversations with fellow teammates like Brandon, Corey, and Fayaz. He is also in a group conversation with both Lane and Pavel; these work just like direct messages, but with a handful of teammates rather than just one.
当我们开始讨论本案例研究重构试图解决的一些关键问题时,理解公共渠道和私人渠道之间的区别变得非常重要。
Understanding the distinction between public and private channels becomes important when we start discussing some of the key problems this case study refactor sought to solve.
您可能已经注意到,侧边栏中的某些频道以亮白色加粗显示。这表示它们包含您尚未阅读的新消息。如果 Matt 选择 #brainstorming,他会找到一些新内容来阅读,侧边栏中的频道会淡化以与其他频道相匹配。
You may have noticed that some of the channels in the sidebar appear bolded in bright white. This indicates that they contain new messages you haven’t read yet. If Matt were to select #brainstorming, he would find some new content to read, and the channel in the sidebar would fade to match the others.
虽然 Slack 还有很多其他内容,但本文涵盖了您在深入研究此案例研究的历史背景之前需要了解的基础知识。
While there’s much, much more to Slack, this covers the basics you’ll need to understand before we dive into the historical context leading up to this case study.
现在让我们来探讨一下 Slack 架构中的一些基本组件,这些组件是我们研究的核心。值得注意的是,其中一些组件已经发生了重大变化,超出了本章概述的重构工作,因此这里提供的细节并不能准确反映 Slack 目前的架构方式。
Now let’s explore a few basic components of Slack’s architecture that are at the core of our study. It’s important to note that some of these components have changed significantly beyond the refactoring effort outlined in this chapter, so the details provided here do not accurately reflect how Slack is architected today.
让我们看一个获取给定频道的消息历史记录的简单请求。我将启动我的 Slack 实例并打开我最喜欢的频道之一 #core-infra-sourdough(如图 10-2所示),其中有几位基础设施工程师正在讨论酸面团烘焙。
Let’s take a look at a simple request to fetch message history for a given channel. I’ll boot up my Slack instance and pop open one of my favorite channels, #core-infra-sourdough (shown in Figure 10-2), where a handful of infrastructure engineers discuss sourdough baking.
如果我监控网络流量,我会看到一个GET对 Slack API 的请求,channels.history其频道 ID 为 #core-infra-sourdough。该请求将首先到达负载均衡器以到达可用的 Web 服务器。接下来,服务器将验证有关该请求的一些信息。这包括确认提供的令牌有效,以及我有权访问我想要阅读的频道。如果我有访问权限,服务器将从相应的数据库获取最新消息,对其进行格式化,然后将其返回给我的客户端。瞧!只需几毫秒,我就可以获取我选择的频道的最新内容。
If I monitored network traffic, I would have seen a GET request to the Slack API for channels.history with the channel ID for #core-infra-sourdough. The request would first hit a load balancer to reach an available web server. Next, the server would verify a few things about the request. This includes confirming that the provided token is valid and that I have access to the channel I want to read. If I had access, the server would fetch the most recent messages from the appropriate database, format them, and return them to my client. Voila! In just a few milliseconds, I could fetch the most recent content for the channel I selected.
服务器如何知道要访问哪个数据库才能找到正确的消息?在产品中,所有内容都属于一个工作区。所有消息都包含在频道中,所有频道都包含在工作区中。将所有内容映射到一个逻辑单元,为我们提供了一种方便的水平分布数据的方式。
How did the server know which database to reach out to in order to locate the correct messages? Within the product, everything belonged to a single workspace. All messages were contained within channels, and all channels were contained within a workspace. Having everything map to a single, logical unit gave us a convenient way of horizontally distributing our data.
每个工作区都分配给一个数据库分片,所有相关信息都存储在其中。如果用户是某个工作区的成员,并且想要获取所有可用公共频道的列表,我们的服务器将进行初始查询,找出哪个分片包含该工作区的数据,然后查询该分片以获取频道。
Every workspace was assigned to a single database shard, where all of its relevant information was stored. If a user was a member of a workspace and wanted to get a list of all the public channels available, our servers would make an initial query to find out which shard contained the workspace’s data and then query that shard for the channels.
如果某个大客户规模扩大,开始占用与其他公司共享的分片中的更多空间,我们会将这些其他公司重新分配到不同的分片中,为不断增长的客户留出更多回旋余地。如果某个客户是其分片的唯一占用者,并且其规模继续扩大,我们会升级分片的硬件以适应增长。总而言之,我们的数据库结构如图10-3所示。
If a large customer grew and began to occupy more space within a shard that it shared with other companies, we redistributed these other companies to different shards, giving the growing customer more wiggle room. If a customer was the sole occupant of their shard and they continued to grow, we upgraded the shard’s hardware to accommodate the growth. All in all, our database structure looked as pictured in Figure 10-3.
接下来,我们将了解一下在每个工作区分片中是如何存储一些关键信息的。具体来说,我们将研究频道和频道成员资格。2017 年初,Slack 有几个表负责存储有关频道的信息。我们有一张表用于存储公共频道的信息,称为teams_channels。我们还有另一张表,groups用于存储私人频道和群组直接消息(多个用户之间的消息)的信息。每个表都包含有关频道的基本信息,例如频道名称、创建时间和创建者。图 10-4说明了我们用来存储频道信息的两个表的几个示例行。
Next, we’ll take a peek at how we stored a few key pieces of information in each workspace shard. Specifically, we’ll look at channels and channel membership. At the start of 2017, Slack had a few tables responsible for storing information about channels. We had a table that stored information for public channels, called teams_channels. We had another table, groups, which stored information for private channels and group direct messages (messages among more than one user). Each of these tables contained basic information about the channel, things like the name of the channel, when it was created, and who created it. Figure 10-4 illustrates a few sample rows of two tables we used to store channel information.
teams_channels和的简化表架构groupsteams_channels_members我们分别在和上存储了这些频道成员的信息groups_members。对于每个成员,我们将存储一行,该行由工作区 ID、频道 ID 和用户 ID 的组合唯一标识。我们还存储了有关该用户成员资格的一些关键信息,例如他们加入频道的日期和他们最后一次阅读该频道内容的时间(以 Unix 纪元时间戳表示)。图 10-5表明这两个表几乎完全相同。
We stored information about members of those channels on teams_channels_members and groups_members, respectively. For each member, we would store a row uniquely identified by the combination of workspace ID, channel ID, and user ID. We additionally stored some key pieces of information regarding that user’s membership such as the date that they joined the channel and the time, as a Unix epoch timestamp, at which they last read content in that channel. Figure 10-5 demonstrates that these two tables were nearly identical.
teams_channels_members和的
简化表架构groups_members最后,对于直接消息,我们有一个名为的表(如图 10-6teams_ims所示)来存储有关频道本身及其成员资格的信息。
Finally, for direct messages, we had a single table called teams_ims (shown in Figure 10-6) to store information about both the channel itself and its membership.
teams_ims总共,我们有三个不同的表来存储有关频道的信息,还有三个不同的表来存储有关频道成员资格的信息。图 10-7说明了每个表与其处理的频道类型相关的作用。
In total, we had three distinct tables to store information about channels, and three distinct tables to store information about channel membership. Figure 10-7 illustrates the role of each table as it relates to the kind of channel it dealt with.
否现在,我们对 Slack 的基本架构有了更好的了解,更具体地说,了解了频道和频道成员的表示方式,我们可以深入研究由此产生的问题。我们将描述我们遇到的三个最严重的问题,这些问题是我们当时最大的客户所遇到的,在本章的其余部分,我们将其称为超大型企业,简称 VLB。
Now that we have a better understanding of Slack’s basic architecture and, more specifically, how channels and channel membership were represented, we can dive into the problems that arose as a result. We’ll describe three of the most serious problems we encountered, as they were experienced by our largest customer at the time, which we’ll refer to as Very Large Business, or VLB for short, for the remainder of the chapter.
VLB 迫切希望其 35 万名员工都能使用 Slack。起初,该公司使用该产品的速度很慢,但在 2017 年头几个月,其使用量开始大幅增加。到 4 月份,其平台上的用户数已超过 5 万,几乎是我们第二大客户的两倍。VLB 开始遇到几乎每个产品的局限性。当时,我是负责 Slack 对我们最大客户的表现的团队的一员。几个星期以来,我们团队轮流值班,其中两个人需要在早上 6:30 坐在旧金山总部的办公桌前,随时准备应对 VLB 在东海岸的高峰登录时间期间出现的任何紧急问题。当我们的团队迅速忙于修复各种问题时,我们开始注意到,由于我们拥有用于存储频道成员资格的冗余数据库表,每个问题都变得更加严重。
VLB was eager for all of its 350,000 employees to use Slack. It had begun using the product slowly at first but began ramping up its usage aggressively during the first few months of 2017. By April, it had just over 50,000 users on the platform, nearly double that of our second-largest customer. VLB started hitting the limitations of nearly every piece of our product. At the time, I was part of the team responsible for Slack’s performance with our biggest customers. For several weeks, our team shared a rotation whereby two of us needed to be at our desks in our San Franciso headquarters at 6:30 a.m. to be ready to respond to any immediate issues during VLB’s peak log-in time on the East Coast. As our team scrambled quickly to patch problems left and right, we began to notice that each of them was exacerbated by the fact that we had redundant database tables for storing channel membership.
每个工作日早上,从东部时间上午 9 点开始,VLB 员工就会开始登录 Slack。随着越来越多的人开始工作,VLB 数据库分片上的负载也开始增加。我们现有的检测显示,罪魁祸首很可能是我们在启动时调用的最重要的 API 之一rtm.start。
Every weekday morning, starting at 9 a.m. eastern time, VLB employees would start logging on to Slack. As more people began their workday, more load began to pile up on VLB’s database shard. Our existing instrumentation showed us that the culprit was most likely one of the most crucial APIs we called on startup, rtm.start.
此 API 返回填充用户侧边栏所需的所有信息;它获取用户所属的所有公共和私人频道、获取他们打开的所有群组和私信,并确定这些频道中是否有他们尚未阅读的消息。然后,客户端将解析结果,并在界面上填充一个整齐的粗体和非粗体 对话列表。
This API returned all the necessary information to populate a user’s sidebar; it fetched all the public and private channels the user was a member of, fetched all the group and direct messages they had open, and determined whether any of those channels contained messages that they hadn’t yet read. The client would then parse the result and populate the interface with a tidy list of bolded and unbolded conversations.
从服务器角度来看,这是一个非常昂贵的过程。要确定用户的会员资格,我们需要查询三个表:teams_channels_members、groups_members和teams_ims。从每组会员资格中,我们提取channel_id并获取相应的teams_channels或groups行以显示频道名称。我们还查询该messages表以获取其最新消息的时间戳,并将其与用户的last_read时间戳进行比较以确定他们是否有任何未读消息。我们单独执行了绝大多数这些查询,每次都会产生网络往返费用。
From the server perspective, this was an incredibly expensive process. To determine a user’s memberships, we needed to query three tables: teams_channels_members, groups_members, and teams_ims. From each set of memberships, we extracted the channel_id and fetched the corresponding teams_channels or groups row to display the channel name. We also queried the messages table to fetch the timestamp of its most recent message, which we compared to the user’s last_read timestamp to determine whether they had any unread messages. We executed the vast majority of these queries individually, incurring network roundtrip costs each time.
一天中,我们偶尔会注意到数据库的昂贵查询出现高峰。我们的仪表板显示了几个潜在的候选调用点,包括负责计算文件可见性的函数,该函数是我们大多数文件相关 API 的核心。打开目标函数后,我们再次面对一组复杂的查询。
Sporadically throughout the day, we noticed spikes in expensive queries to the database. Our dashboards surfaced a few potential candidate callsites, including the function responsible for calculating file visibility at the core of most of our files-related APIs. Popping open the target function, we yet again came face to face with a set of complex queries.
当用户将文件上传到 Slack 时,服务器会向表写入一个新行,files表示文件的名称、其在远程文件服务器上的位置以及一些其他相关信息。每当将文件共享到频道时,我们都会向表写入一个新条目files_share,表示文件 ID 和共享到的频道的 ID。当文件共享到公共频道时,工作区上的任何用户都可以看到它,并且通过将is_public其files行上的列设置为 true 来表示该文件是可公开发现的。因此,在最简单的情况下,该文件是公开的,我们很快就知道它是公开的,我们可以向用户显示它。
When a user uploads a file to Slack, the servers write a new row to the files table denoting the file’s name, its location on our remote file server, and a handful of other relevant pieces of information. Whenever a file is shared to a channel, we write a new entry to the files_share table, denoting the file ID and the ID of the channel to which it was shared. When a file is shared to a public channel, it becomes visible to any user on the workspace and is denoted as publicly discoverable by setting the is_public column to true on its files row. Thus, in the simplest case, the file is public, we know it is quickly, and we can reveal it to the user.
但是,当文件不公开时,逻辑会变得稍微复杂一些。我们必须交叉引用用户所属的所有频道和共享文件的所有频道。与 的情况一样rtm.start,要确定用户的完整频道成员资格,我们必须查询三个不同的表。然后,我们将这些结果与files_shares目标文件表中的结果相结合。如果我们找到匹配项,我们可以向用户显示该文件;如果没有,我们会向客户端返回错误。
When a file isn’t public, however, the logic becomes a little bit more complicated. We have to cross-reference all channels that the user is a member of with all the channels where the file was shared. As is the case for rtm.start, to determine a user’s complete set of channel memberships, we had to query three distinct tables. We then combined those results with those from the files_shares table for the target file. If we found a match, we could show the file to the user; if not, we returned an error to the client.
在整个工作日期间,导致 VLB 分片负载最严重的查询是负责确定用户(或他们订阅的主题)是否在频道中被提及但尚未阅读这些消息的查询。在 Slack 中,提及可以是任意数量的内容。它可以是用户名或以符号为前缀的用户名@。它可以是用户在其用户偏好设置中启用了通知的高亮词。然后,客户端将使用该数据在侧边栏中相应频道名称右侧的徽章中填充未读提及数。您可以在示例 10-1中看到许多复杂的提及相关查询中的一个,它有 40 行代码。
The query that caused the most consistent amount of load on VLB’s shard for the full duration of the workday was the query responsible for determining whether a user (or the topics they subscribe to) were mentioned in a channel and hadn’t yet read those messages. A mention can be any number of things within Slack. It can be a username or a username prefixed with the @ symbol. It can be a highlight word for which the user has enabled notifications within their user preferences. The client would then use that data to populate badges with the number of unread mentions to the right of the corresponding channel name in the sidebar. You can see one of the many complex mentions-related queries in its 40-line glory in Example 10-1.
此查询再次要求获取三个会员表中的用户会员资格。棘手的部分是,当我们需要排除已删除或存档相关频道的任何会员资格时,需要我们将会员资格结果与其相应的频道行合并groups到teams_channels。
This query, yet again, required fetching a user’s memberships across the three membership tables. The tricky part was when we needed to exclude any memberships for which the associated channels were deleted or archived, requiring us to join the membership results with their corresponding channel row on either groups or teams_channels.
%象征替代语法SELECTtcm.channel_idaschannel_id,'C'astype,tcm.last_readfromteams_channelstcINNERJOINteams_channels_memberstcmON(tc.team_id=tcm.team_idANDtc.id=tcm.channel_id)WHEREtc.team_id=%TEAM_IDANDtc.date_delete=0ANDtc.date_archived=0ANDtcm.user_id=%USER_IDUNIONALLSELECTgm.group_idaschannel_id,'G'astype,gm.last_readfromgroupsgINNERJOINgroups_membersgmON(g.team_id=gm.team_idANDg.id=gm.group_id)WHEREg.team_id=%TEAM_IDANDg.date_delete=0ANDg.date_archived=0ANDgm.user_id=%USER_IDUNIONALLSELECTchannel_idaschannel_id,'D'astype,last_readFROMteams_imsWHEREteam_id=%TEAM_IDANDuser_id=%USER_ID
SELECTtcm.channel_idaschannel_id,'C'astype,tcm.last_readfromteams_channelstcINNERJOINteams_channels_memberstcmON(tc.team_id=tcm.team_idANDtc.id=tcm.channel_id)WHEREtc.team_id=%TEAM_IDANDtc.date_delete=0ANDtc.date_archived=0ANDtcm.user_id=%USER_IDUNIONALLSELECTgm.group_idaschannel_id,'G'astype,gm.last_readfromgroupsgINNERJOINgroups_membersgmON(g.team_id=gm.team_idANDg.id=gm.group_id)WHEREg.team_id=%TEAM_IDANDg.date_delete=0ANDg.date_archived=0ANDgm.user_id=%USER_IDUNIONALLSELECTchannel_idaschannel_id,'D'astype,last_readFROMteams_imsWHEREteam_id=%TEAM_IDANDuser_id=%USER_ID
现在,我们对要解决的问题有了足够的了解,我们可以开始讨论重构了。我希望我可以说将teams_channels_members和groups_members合并到一张表中是一个精心策划、执行巧妙的项目,但事实并非如此。事实上,重构中较为混乱的部分正是本书中许多想法的灵感和来源。我们以紧迫感开始工作,在进行过程中没有很好地跟踪进度,最后,虽然我们知道我们已经减少了大部分数据库层的负载,但我们只能指出一个指标来大致说明减少了多少。最终使该项目取得成功的是帮助我们跨越终点线的聪明、敬业的团队。虽然我们最大的客户将从重构中受益最多,但最终我们所有的客户都从项目中受益。
Now that we have sufficient background on the problem we aimed to solve, we can begin to discuss the refactor. I wish I could say that consolidating teams_channels_members and groups_members into a single table was a well-planned and smartly executed project, but that would not be true. In fact, the more chaotic portions of the refactor are what inspired and informed a great deal of the ideas in this book. We kicked things off with a sense of urgency, didn’t keep great tabs on progress as we went along, and in the end, although we knew we had decreased the load across most of our database tier, we could only point to a single metric to demonstrate roughly by how much. What ultimately made the project a success was the smart, dedicated set of individuals who helped us cross the finish line. Although our largest customers stood to benefit the most from the refactor, all of our customers ultimately benefited from the project.
我们立即开始了这个项目,而且没有书面计划。我们的首要任务是整合表格,以便能够迁移对我们的数据库碎片影响最大的查询:提及查询。
We started the project somewhat immediately and without a written plan. Our top priority was to get the consolidation of the tables just to the point where we could migrate the one query that was hammering our database shards the most: the mentions query.
尽管我们知道很多查询都会从合并表中受益,但它们的迁移完全是次要的。在第 1 章中,我强烈建议您不要着手进行大规模重构,除非您有信心可以完成它。在这种情况下,我们当然打算完成表合并;我们只是不知道是否会有其他更紧迫的性能问题出现,需要优先于重构。考虑到手头问题的紧迫性,我们愿意承担风险,充分意识到如果我们无法完成迁移会有什么后果。
Although we knew that a great many queries would equally benefit from the consolidated table, their migration was strictly secondary. In Chapter 1, I strongly suggested that you not embark on a large-scale refactor unless you are confident that you can finish it. In this case, we certainly intended to finish the table consolidation; we just didn’t know whether other, more pressing performance issues might creep up and need to be prioritized over the refactor. We were willing to take the risk, given the urgency of the problem at hand, fully aware of the consequences if we failed to finish the migration.
首先,我们创建了一张新表。channels_members我们合并了成员资格表的架构,使用相同的索引完成,并引入了一个新列来表示一行是否源自teams_channels_members或groups_members,这既是为了简化迁移,又是为了确保我们可以尊重初始表周围的任何业务逻辑依赖关系。图 10-8显示了我们的目标状态,而图 10-7是我们的起始状态。
First, we created a new table, channels_members. We combined the schemas of the membership tables, completed with the same indices, and introduced a new column to denote whether a row originated from teams_channels_members or groups_members, both to ease the migration and ensure that we could respect any business-logic dependencies around the initial tables. Figure 10-8 shows our goal state as compared to Figure 10-7, our starting state.
R重新编写查询以定位单个新表并不容易。Slack 的代码库采用非常命令式的风格编写,从短函数到长函数,分布在数百个松散的命名空间文件中。它的原作者坚持他们熟悉的东西,并避开面向对象模式,因为 PHP 存在性能问题。他们更喜欢内联编写单个查询,而不是依赖对象关系映射库,否则可能会过早导致代码库膨胀。
Rewriting our queries to target a single new table would not be easy. Slack’s codebase was written in a very imperative style, with everything from short functions to long functions, distributed across hundreds of loosely namespaced files. Its original authors had stuck to what they knew well and steered clear of object-oriented patterns due to performance concerns with PHP. They preferred writing individual queries inline rather than relying on an object-relational mapping library and risk bloating the codebase early.
teams_channels_members对或 的一次性查询groups_members分散在 126 个文件中。许多查询自产品发布之前就再也没有被触及过。最重要的是,我们知道包含这些查询的大部分代码没有很好的单元测试覆盖率。为了让您了解这些代码可能是什么样子,我挖掘了一些旧代码,您可以在示例 10-2中看到。
One-off queries to either teams_channels_members or groups_members were strewn across 126 files. Many of the queries hadn’t been touched since well before the product launched. To top it off, we knew much of the code that contained these queries didn’t have great unit test coverage. To give you a sense of what these might have looked like, I dug up some old code, which you can see in Example 10-2.
teams_channels_membersfunctionchat_channels_members_get_display_counts($team,$user,$channel){// Some business logic$sql="SELECTCOUNT(\*) as display_counts,SUM(CASEWHEN (is_restricted != 0 OR is_ultra_restricted != 0)THEN 1ELSE 0END) as guest_countsFROMteams_channels_members AS tcmINNER JOIN users AS u ON u.id = tcm.user_idWHEREtcm.team_id = % team_idAND tcm.channel_id = % channel_idAND u.deleted = 0";$ret=db_fetch_team($team,$sql,array('team_id'=>$team['id'],'channel_id'=>$channel['id']));// A bit more business logicreturn$counts;}
functionchat_channels_members_get_display_counts($team,$user,$channel){// Some business logic$sql="SELECTCOUNT(\*) as display_counts,SUM(CASEWHEN (is_restricted != 0 OR is_ultra_restricted != 0)THEN 1ELSE 0END) as guest_countsFROMteams_channels_members AS tcmINNER JOIN users AS u ON u.id = tcm.user_idWHEREtcm.team_id = % team_idAND tcm.channel_id = % channel_idAND u.deleted = 0";$ret=db_fetch_team($team,$sql,array('team_id'=>$team['id'],'channel_id'=>$channel['id']));// A bit more business logicreturn$counts;}
围绕这些查询的业务逻辑代码将直接索引到结果列中,从而巩固了数据库模式和代码之间的紧密耦合。每当我们引入新列时,我们都必须更新相应的代码以将其考虑在内。假设我们在表上有一列files用于is_public表示文件是否公开。如果我们后来引入了额外的逻辑,要求我们检查额外的属性来确定文件是否公开,那么任何依赖于简单检查的代码都if ($file['is_public'])需要更新以正确适应这种变化。
Business logic code surrounding these queries would index directly into the resulting columns, cementing a tight coupling between our database schemas and the code. Whenever we introduced new columns, we had to update corresponding code to take it into consideration. Say we had a column on the files table called is_public to denote whether the file was public. If we later introduced additional logic that required us to check an additional property to determine whether the file was public, any code that relied on a simple check of if ($file['is_public']) would need to be updated to accommodate for that change properly.
为了将teams_channels_members和groups_members合并到中channels_members,我们需要识别代码库中分散的所有对表的查询。快速grep浏览代码库,我们能够提取查询groups_members或的所有位置的列表。我们将文件和行号列表直接插入共享的 Google 表格文件中,如图 10-9teams_channels_members所示。
To consolidate teams_channels_members and groups_members into channels_members, we needed to identify all the queries to either table scattered across the codebase. A quick grep of the codebase and we were able to extract a list of all the locations where we queried groups_members or teams_channels_members. We plugged the list of files and line numbers directly into a shared Google Sheets file, shown in Figure 10-9.
teams_channels_members和groups_members我们决定创建一个文件,用于存放与频道会员相关的所有查询。我们试图恢复陷入困境的会员查询的努力恰好发生在工程师开始讨论集中查询的同时。我们是一个不断壮大的团队,试图快速执行,并且每次我们修改一个表时都需要记住更新代码库中随意角落的查询,这变得很乏味。我们提出了一些建议,工程师们赞成将所有查询存储在一个文件中。虽然有些人想要一种方法,让他们能够根据一组参数生成查询,从而构建一个更复杂的数据访问层,但其他人希望继续能够以内联方式读取查询。我们决定,通过这个项目,我们将原型化最小查询生成,以此来限制新文件中单个函数的数量。我们决定将这个新模式称为unidata,简称 ud ,从而将我们的目标文件命名为ud_channel_membership.php。
We decided to create a single file where we could house all the queries related to channel membership. Our effort to revive our struggling membership queries conveniently arose around the same time engineers had begun having conversations about centralizing our queries. We were a growing team, trying to execute quickly, and needing to remember to update queries in haphazard corners of the codebase every time we altered a table was getting tedious. A few proposals had been shopped around, with engineers in favor of storing all queries to a given table in a single file. While some wanted an approach that would allow them to generate queries, given a set of parameters, leading us to build a more complex data access layer, others wanted to continue to be able to read the queries inline. We decided that with this project, we’d prototype minimal query generation as a means of limiting the number of individual functions in our new file. We decided to call this new pattern unidata, or ud for short, thus naming our target file ud_channel_membership.php.
现在我们有了要迁移的表和一组查询,我们可以开始了。我们需要从初始中识别每个查询grep,这些查询插入了行、更新了值或删除了行。对于每个查询,我们在 unidata 库中创建了一个包含副本的相应函数。每个函数都会接受一个参数来指示是否对teams_channels_members或执行查询groups_members,以及一些逻辑来有条件地对我们的新表 执行相同的查询。一般思路如示例 10-3channels_members所示。
Now that we had a table and a set of queries to migrate, we could get started. We needed to identify each of the queries from our initial grep, which inserted rows, updated values, or deleted rows. For each query, we created a corresponding function in our unidata library containing a copy. Each function would take a parameter to indicate whether to execute the query on teams_channels_members or groups_members, alongside some logic to execute the same query conditionally against our new table, channels_members. The general idea is shown in Example 10-3.
functionud_channel_membership_delete($team,$channel_id,$user_id,$channel_type){if($channel_type=='groups'){$sql='DELETE FROM groups_members WHERE team_id=%team_id ANDgroup_id=%channel_id AND user_id=%user_id';}else{$sql='DELETE FROM teams_channels_members WHERE team_id=%team_id ANDchannel_id=%channel_id AND user_id=%user_id';}$bind=array('team_id'=>$team['id'],'channel_id'=>$channel_id,'user_id'=>$user_id,);$ret=db_write_team($team,$sql,$bind);if(feature_enabled('channel_members_table')){$sql='DELETE FROM channels_members WHERE team_id=%team_id ANDchannel_id=%channel_id AND user_id=%user_id';$double_write_ret=db_write_team($team,$sql,$bind);if(not_ok($double_write_ret)){log_error("UD_DOUBLE_WRITE_ERR: Failed to delete row forchannels_members for{$team['id']}-{$channel_id}-{$user_id}");}}return$ret;}
functionud_channel_membership_delete($team,$channel_id,$user_id,$channel_type){if($channel_type=='groups'){$sql='DELETE FROM groups_members WHERE team_id=%team_id ANDgroup_id=%channel_id AND user_id=%user_id';}else{$sql='DELETE FROM teams_channels_members WHERE team_id=%team_id ANDchannel_id=%channel_id AND user_id=%user_id';}$bind=array('team_id'=>$team['id'],'channel_id'=>$channel_id,'user_id'=>$user_id,);$ret=db_write_team($team,$sql,$bind);if(feature_enabled('channel_members_table')){$sql='DELETE FROM channels_members WHERE team_id=%team_id ANDchannel_id=%channel_id AND user_id=%user_id';$double_write_ret=db_write_team($team,$sql,$bind);if(not_ok($double_write_ret)){log_error("UD_DOUBLE_WRITE_ERR: Failed to delete row forchannels_members for{$team['id']}-{$channel_id}-{$user_id}");}}return$ret;}
成功转移所有写入操作后,我们编写了一个回填脚本,将两个会员表中的所有现有数据复制到我们的新表中。请注意,我们在开始回填之前迁移了写入操作,以确保新表中的数据准确无误。然后,我们回填了我们自己工作区的所有会员数据,随后在下班时间及时回填了 VLB,以防止在工作日产生任何不必要的负载。我们反复检查,确保两个表中的任何错误写入都没有留在我们的新库之外,但考虑到工程组织行动迅速,我们错过一两个查询的可能性不为零。我们尚未实施任何机制来防止其他团队的工程师在未通知我们的情况下添加新查询,因此为了确保回填数据与实时数据保持一致,我们警告了工程团队我们的流程(参见图 10-10),并编写了一个脚本,我们可以手动启动以识别任何不一致之处,并根据需要选择性地对其进行修补。
Once we had successfully moved over all write operations, we wrote a backfill script to copy all existing data from both membership tables onto our new table. Note that we migrated write operations before starting a backfill to ensure that the data in the new table would be accurate. We then backfilled all membership data for our own workspace, followed promptly by VLB during off-hours to prevent any unnecessary load during their workday. We tripled-checked that no errant writes to either table remained outside of our new library, but given that the engineering organization was moving quickly, there was a nonzero chance we had missed one or two queries. We had not yet put any mechanisms in place to prevent an engineer on a different team from adding a new query without alerting us, so to ensure that the backfilled data remained consistent with the live data, we warned our engineering team about our process (see Figure 10-10) and wrote a script we could manually kick off to identify any inconsistencies and optionally patch them if desired.
在本章中的一些屏幕截图中,您可能会看到一些对 TS 的引用。TS 是 Tiny Speck 的缩写,这是该公司在 2014 年公开推出产品 Slack 之前的名称。如果您看到“启用 TS”的引用,这仅表示我们正在启用对我们自己的工作区的更改。
In some of the screenshots included in this chapter, you might see some references to TS. TS is short for Tiny Speck, the previous name of the company before Slack, the product, was launched publicly in 2014. If you see a reference to something being “enabled to TS,” this just means that we’re enabling the change to our own workspace.
在为 VLB 启用双写功能后,我们密切关注其数据库的运行状况;teams_channels_members行groups_members更新非常频繁。每当用户阅读新消息时,客户端都会向服务器发出请求,以更新其last_read成员资格行上的用户时间戳。现在,随着 的增加channels_members,我们发出的写入次数增加了一倍。我们花了一天时间监控流量,以确保工作区有足够的带宽来处理额外的负载。
After enabling double-writing for VLB, we watched its database health carefully; teams_channels_members and groups_members rows were updated very frequently. Whenever a user read a new message, the client issued a request to the servers to update the user’s last_read timestamp on their membership row. Now, with the addition of channels_members, we were issuing double the number of writes. We spent a day monitoring traffic to gain confidence that the workspace had enough bandwidth to handle the additional load.
现在我们的表格已经同步,并且我们正在进行双重写入更新,我们可以执行最重要的里程碑:迁移提及查询。每当我们准备在生产中尝试某件事时,我们都会首先将其推广到我们自己的团队。这曾经是(现在仍然是)在生产中测试我们工作的典型策略,无论是新功能、新基础设施,还是在我们的情况下是性能增强。我们通常会在接下来推广到免费工作区,慢慢地逐步提高付费等级,最后将我们最大的、对性能最敏感的客户留在最后;但是对于这项特殊的努力,我们希望首先减轻那些顶级客户的负担。所以我们彻底改变了我们的策略。
Now that our tables were in sync and we were double-writing updates, we could execute on our most important milestone: migrating the mentions query. Whenever we were ready to give something a try in production, we first rolled it out to our own team. This was (and still is) the typical strategy for testing our work in production, whether it’s a new feature, a new piece of infrastructure, or, in our case, a performance enhancement. We typically would have rolled out to free workspaces next, slowly working our way up the payment tiers, leaving our largest, most performance-sensitive customers last; but with this particular endeavor, we wanted to ease the load on those top-tier customers first. So we flipped our strategy on its head.
我们启用了对团队的优化提及。由于我们没有太多的自动化测试,而且我们的单元测试框架无法正确测试查询,因此在向任何其他客户启用查询之前,我们依靠内部人员发现任何回归问题。我们仔细监控了员工通常报告错误的渠道。我们后来为 VLB 启用了此行为。
We enabled optimized mentions to our team. Because we didn’t have much automated testing and our unit testing framework was unable to test the query properly, we relied on folks internally to spot any regressions before we enabled the query to any other customers. We carefully monitored channels where employees typically reported bugs. We later enabled this behavior for VLB.
我们知道我们的数据库超载了。我们通过查看其 CPU 的空闲百分比来衡量其健康状况。通常,该百分比徘徊在 25% 左右,但经常会降至 10% 或以下。这很麻烦,因为空闲时间越长,它就越不可能处理突然增加的负载。VLB 正在对我们的产品进行测试,我们永远不知道产品的哪个部分会导致数据库使用率意外上升。
We knew that our databases were overloaded. We measured their health by looking at what percentage of their CPU was idle. Typically, this would hover at about 25 percent but would regularly dip to 10 percent and below. This was troubling because the more time it spent at less than 25 percent idle, the less likely it would be able to handle a sudden increase in load. VLB was putting our product through its paces, and we never knew which part of the product would lead to an unexpected uptick in database usage next.
当我们开始整合工作时,我们已经有多个其他项目在并行运行,以帮助解决负载问题。在一系列正在进行的工作流中,由于双重写入、反复波动以及产品工程不断构建新功能而增加的负载,我们无法依赖数据库使用数据来确认重构是否有效。此外,我们的监控数据在大约一周后就消失了,所以除非我们选择一个安静的日子来捕获一些屏幕截图并记录一系列数据点,否则完成后我们将无法获得这些数据作为良好的基准。
When we began the consolidation effort, we already had multiple other projects running in parallel to help address the load. Among the range of ongoing workstreams, the added load due to double-writing, recurrent fluctuations, and product engineering continuing to build out new features, we couldn’t rely on our database usage data to confirm that the refactor was effective. Besides, our monitoring data disappeared after about a week, so unless we had chosen a quiet day to capture some screenshots and record a series of data points, the data wouldn’t have been available to us upon completion to serve as a good baseline.
相反,我们选择主要依赖查询时间数据。我们用时间指标来检测每个查询,这使我们能够确认新查询是否确实更高效。EXPLAIN计划可能非常有见地,但没有什么比拥有实际指标来从服务器的角度跟踪执行查询所花费的时间更好。为了谨慎起见,我们没有立即向所有 VLB 用户启用新处理,而是将传入请求随机分配给任一查询。我们首先验证了工作区是否启用了功能标志,然后随机分配了 50-50 的流量。这使我们在引入更改时更加谨慎,并确认新查询对于像 VLB 这样大的客户确实更高效。
Instead, we chose to rely primarily on query timings data. We instrumented each query with timing metrics, allowing us to confirm whether the new query was in fact more performant. EXPLAIN plans can be quite insightful, but nothing beats having actual metrics to track the time spent executing a query from the server’s perspective. In an abundance of caution, instead of enabling the new treatment to all VLB users immediately, we randomly assigned incoming requests to either query. We first verified that the feature flag was enabled for the workspace and then randomly distributed the traffic 50-50. This enabled us to be a little bit more careful with our introduction of the change and confirmed that the new query was in fact more performant with a customer as large as VLB.
我们等了几个小时才查看数据。我们需要确保新查询始终更快,这意味着它需要在数据库处于平均负载和峰值使用率时都更快。值得庆幸的是,数据看起来很有希望,速度提高了 20%!您可以在图 10-11中看到我们提取的原始数据。第一个查询连接了teams_channels_members和groups_members,平均完成时间约为 4.4 秒。第二个查询单独读取channels_members和,平均完成时间约为 3.5 秒。我们设法通过使用合并的成员资格表节省了近一秒钟。(两个查询都太长,无法完整显示,因此时间图表中只能看到前几行。)
We waited a few hours before taking a look at our data. We needed to make sure that the new query was consistently faster, meaning it needed to be faster both when the database was under average load and when it was at peak usage. Thankfully, the data looked promising across the board with a 20 percent speed-up! You can see the original data we pulled in Figure 10-11. The first query joined across both teams_channels_members and groups_members and on average completed in about 4.4 seconds. The second query read from channels_members alone and on average completed in about 3.5 seconds. We managed to shed nearly a second by using the consolidated membership table. (Both queries were too long to show in full, so only the first few lines are visible in the timings chart.)
在确认我们的重构对我们最重要的用例有效后,我们可以继续进行剩余的整合。我们参考了 Google Sheet 跟踪器,并开始将剩余的读取查询分配给我们团队的工程师。
With the confirmation that our refactor did the trick for our most important use case, we could justify moving forward with the remainder of the consolidation. We referred back to our Google Sheet tracker and began divvying up the remainder of the read queries to engineers on our team.
不幸的是,我们很难获得完成迁移所需的帮助。由于有这么多事情需要解决,我们团队中的每个人都在不同的补救措施中并行工作。很难让其他人花几个小时仔细提取一些查询。最糟糕的是,剩余查询周围的大部分代码都没有经过测试,这让本来应该简单、直接的更改变得非常危险。花一个下午的时间迁移查询根本就不具吸引力。
Unfortunately, it was difficult to get the help we needed to finish the migration. Given so many fires to put out, everyone on our team was parallelized across distinct remediation efforts. It was tough to get anyone else to take a few hours out of their day to carefully extract a handful of queries. To top it off, most of the code surrounding the remaining queries was untested, making what should have been a simple, straightforward change quite dangerous. Spending an afternoon migrating queries was simply not enticing.
我考虑过向企业工程团队的其他团队寻求帮助,并聘请公司其他几位注重性能的开发人员,但最终我决定继续独自努力,偶尔也寻求我最亲密的队友的帮助。因为这项工作风险很大,而且不太能激发智力,所以我认为说服更多工程师参与进来可能太难了。事后看来,我认为我本可以找到一种方法,让这项工作更有吸引力,更均匀地分配工作,并可能节省几周的时间。
I considered reaching out to other teams in the Enterprise engineering team for their help and tapping a handful of other performance-minded developers across the company but, ultimately, decided to keep trudging through on my own, with the occasional help from my immediate teammates. Because the work was risky and not particularly intellectually stimulating, I thought it might be too much of an uphill battle to convince a wider circle of engineers to contribute. In hindsight, I think I could have found a way to make the effort more compelling, distributed the work more evenly, and likely shaved off a few weeks.
几周后,当进度缓慢时,我试图用饼干来贿赂团队,你可以在图 10-12中看到。虽然有许多更传统的选择可以激励工程师提供帮助(参见第 8 章),但有时食物是最好的激励手段。
When progress slowed to a crawl just a few weeks later, I attempted to bribe the team with cookies, which you can see in Figure 10-12. While there is a number of more traditional options for getting engineers motivated to help out (see Chapter 8), sometimes food is the best incentive of all.
尽管我们的团队分布在多个项目中,但我们仍然需要彼此的支持。我们依靠彼此进行代码审查、讨论棘手的错误以及偶尔的直觉检查。为了确保我们能够有效地完成这些角色,同时高度专注于自己的工作,我们会定期在公共渠道(通常是我们自己的团队渠道)上调试性能问题,并每周举行面对面会议,讨论进展和阻碍因素。对我来说,这意味着一个定期的渠道,可以指出代码库中仍然有多少百分比的查询是 乱七八糟的,并讨论我在数据中发现的任何错误或不一致之处。
Although our team was widely distributed across a number of projects, we still needed each other’s support. We relied on one another for code reviews, talking through tough bugs, and the occasional gut check. To make sure we could be effective in those roles while remaining highly focused on our own endeavor, we would regularly debug performance problems in public channels (oftentimes our own team channel) and hold in-person weekly meetings to discuss progress and blockers. For me, that meant a regular avenue to call out what percentage of queries were still littered across the codebase and talk through any bugs or inconsistencies I’d spotted in the data.
每当我们达到一个有意义的里程碑时,比如启用对我们自己的工作区的双重写入,或者启用对 VLB 的新提及查询,我们都会在团队频道和一些工程范围的频道中宣布这一变化,以增加可见性。了解我们正在进行的更改的工程师越多越好!这意味着另一个团队的工程师不太可能在不参考我们的新库的情况下针对我们正在积极弃用的任何一个表引入新查询。这也意味着,当我们对传入的客户错误进行分类时,任何工程师都可以更有效地隔离和解决相关问题。
Whenever we reached a meaningful milestone, like enabling double-writes to our own workspace, or enabling the new mentions query to VLB, we’d announce the change in both our team channel and in a few engineering-wide channels for added visibility. The more engineers that were aware of the changes we were making, the better! It meant that an engineer on another team was less likely to introduce a new query against either table we were actively deprecating without referring to our new library. It also meant that as we triaged incoming customer bugs, any engineer could isolate and solve a related problem much more effectively.
一旦我们的跟踪器中不再有条目,我们就会慢慢开始允许除我们自己的团队(和 VLB)之外的所有其他团队从新表中读取数据。我们让这些更改搁置了两周,然后才决定停止将数据双重写入旧表是安全的。我们希望确保我们的数据库层对新表的响应良好,其数据始终正确,并且没有记录与重构相关的新错误。如果双重写入从负载和金钱角度来看都不昂贵,我们可能会让这些更改搁置一段时间,但我们迫切希望消除开销。
Once no more entries were left in our tracker, we slowly began enabling all other teams beyond our own (and VLB) to read from the new table. We let the changes sit for two weeks before deciding it was safe to stop double-writing data to the old tables. We wanted to be certain that our database tier responded well to the new table, that its data was consistently correct, and that no new bugs related to the refactor were logged. Had double-writing not been expensive from both a load and monetary perspective, we might have allowed the changes to bake a bit longer, but we were eager to remove the overhead.
最后,我们停止了双重编写,首先是我们自己的团队,然后是 VLB,最后是我们剩余的客户。与重构的每个重要步骤一样,我们进行了广泛的沟通,如图10-13teams_channels_members所示。然后,我们通过删除对和的所有引用快速整理了我们的新库groups_members。我们编写了一些新的 linter 规则,防止工程师针对已弃用的表编写新查询,并强制要求针对channels_members表的所有新查询正确放置在我们新的集中式库中。我们希望防止工程师对重构的进展感到困惑。并不是每个人都会阅读跨职能渠道中的所有公告,尤其是在他们休假或休假时,因此重要的是确保您不要仅仅依赖这些公告,以便整个组织的工程师知道当他们遇到重构过程中已更改的代码时该怎么做。
Finally, we stopped double-writing, first for our own team, then for VLB, and finally for the remainder of our customers. As with every important step of our refactor, we communicated it broadly, as shown in Figure 10-13. We then quickly tidied up our new library by removing all references to teams_channels_members and groups_members. We wrote some new linter rules, preventing engineers from writing new queries against either deprecated table and enforcing all new queries against the channels_members table to be properly located in our new centralized library. We wanted to prevent confusion among engineers about how far along we were with the refactor. Not everyone reads all announcements in cross-functional channels, especially if they are out on vacation or leave, so it’s important to make sure you don’t rely on those announcements alone for engineers across your organization to know what to do when they come across code that has been changed as part of your refactor.
以下是图 10-13中的 Slack 消息图表的特写:
Here’s a close-up of the graph in Figure 10-13’s Slack message:
当然,我们没有忘记最重要的最后一步:庆祝!按照旧金山大部分工程团队的传统,我们订购了一个蛋糕(图 10-14),上面装饰着我们新桌子的名字,以纪念项目的完成。
Of course, we didn’t forget the most important final step: celebrating! As was tradition for much of the engineering team in San Francisco, we ordered a cake (Figure 10-14) adorned with the name of our new table to commemorate the completion of the project.
该项目的完整轨迹如图 10-15所示,突出显示了 2017 年 5 月至 9 月期间每天针对每个表执行的查询数量。
The project’s complete trajectory is shown in Figure 10-15, highlighting the number of queries executed against each table on a daily basis from May to September 2017.
teams_channels_members、groups_members和 的
查询量channels_members从这个案例研究中,我们可以学到很多经验教训,既包括进展顺利的地方,也包括可以做得更好的地方。我们将从项目遇到的困难开始,描述没有书面执行计划、忽视对代码退化原因的理解、编写的测试数量不足以及未能激励团队成员等缺陷。然后,我们将讨论进展顺利的地方,强调我们对动态里程碑和一套定义明确的指标的高度关注。
There are a number of lessons to be learned from this case study, both from what went well and what could have gone better. We’ll start with where the project struggled, describing the pitfalls of not having a written execution plan, forgoing understanding of how the code had degraded, skimping on the number of tests we wrote, and failing to motivate teammates. Then we’ll discuss what went well, highlighting our sharp focus on dynamic milestones and a well-defined set of metrics.
由于整个项目开始得非常快,我们没有太多的书面计划。我们的团队熟悉将数据从一个表迁移到另一个表所涉及的过程。我们知道提及查询是我们的首要任务,并且我们只会完成必要的迁移工作;我们会稍后重新评估。该过程唯一一次以书面形式出现是我们在团队频道(而不是专门针对该项目的频道)发布更新时;即便如此,这些也只是整体计划的相关子集。
Because the whole project began so fast, we didn’t have much of a written plan. Our team was familiar with the process involved for migrating data from one table to another. We knew the mentions query was our top priority and that we would complete only as much of the migration as was necessary to do so; we would reevaluate later. The only time the process appeared in written form was when we posted updates in our team channel (rather than in a channel dedicated to the project); even then, these were only pertinent subsets of the overall plan.
我们从未刻意记录下从开始到结束所涉及的每个步骤,这意味着我们更有可能在这一过程中忘记一些关键的事情。也许最令人担忧的是,我们从未向公司其他团队宣传我们的计划,以确保每个人都有机会核实他们是否会受到变更的影响,并表达他们的担忧。我们只是坚持下去,因为我们认为,绩效是我们为改善与最大客户的关系所能做的最重要的事情(并且,从广义上讲,这是我们能为公司做的最重要的事情)。我们还相信,我们可以以一种尽可能少地干扰其他工程团队的方式实施变革。
The fact that we never deliberately wrote down each of the steps involved from start to finish meant that we were more likely to forget something critical along the way. Perhaps most worrisome of all was the fact that we never shopped our plan around to other teams across the company to ensure that everyone had a chance to verify whether they might be affected by the change and voice their concerns if that was the case. We simply plowed through, on the assumption that performance was the most important thing we could be doing to improve our relationship with our largest customer (and, by extension, the most important thing we could be doing for the company). We also believed that we could implement the change in a way that would disrupt as few other engineering teams as possible.
事实证明,这种假设在多个方面都是错误的。首先,当一些不可避免的错误悄然出现,而我们又没有充分地宣传这一变化时,负责应对这些错误的工程师们感到非常不快。其次,我们完全忽略了一支会受到这一变化严重影响的团队。在我们完成最后几个成员资格查询迁移的大约一个月前,一位队友提醒我,我们应该警告数据工程团队我们正在进行的更改。通过将成员资格转移到新表,并接近禁用对旧表的写入的阶段,我们冒着破坏他们的大部分管道的风险,包括负责计算重要使用指标的管道。幸运的是,数据工程团队迅速做出反应并更新了必要的管道,从而避免了一场严重的危机。
This assumption proved to be wrong on multiple fronts. First, when a handful of inevitable bugs crept up and we hadn’t adequately socialized the change, engineers responding to those bugs were unpleasantly surprised. Second, we overlooked a team altogether that would bear an acute impact from the change. About a month before we finished migrating the final few membership queries, a teammate reminded me that we should probably warn the data engineering team about the changes we were making. By moving membership onto a new table, and nearing the stage at which we would disable writes to the old tables, we risked disrupting most of their pipelines, including pipelines responsible for calculating important usage metrics. We were fortunate that the data engineering team was quick to respond and update the necessary pipelines, and a serious crisis was averted.
这些失误表明,制定和审查全面的执行计划是多么重要。我们很幸运,很快就从这些疏忽中恢复过来,但为什么要让本来可以在早期规划阶段更慎重地解决的问题听天由命呢?正如第4章和第7章所强调的那样,制定具体的计划对于尽早发现差距和最大限度地减少跨职能沟通差距至关重要。
These mishaps show just how important it is to develop and vet a thorough execution plan. We were lucky that we recovered from these oversights quickly, but why leave to chance what could have been addressed more deliberately during the early planning stages? As was highlighted in Chapters 4 and 7, having a concrete plan is crucial to uncovering gaps early and minimizing cross-functional communication gaps.
我强烈建议开发人员在开始执行重构工作之前先进行代码考古探索,因为添加的上下文可能会给项目带来不同的形状和方向。不幸的是,由于工作紧迫,我们跳过了理解和理解现有代码的刻意过程,直接开始执行。直到我们开始迁移查询之后,我才开始怀疑为什么我们一开始要区分teams_channels_members和。groups_members
I highly recommend developers begin their code archeology expedition before they begin to execute their refactoring effort, because the added context can give a different shape and direction to the project. Unfortunately, due to the urgency of our work, we skipped the deliberate process of understanding and empathizing with the existing code and went right to execution. It was only well after we had begun migrating queries that I started to wonder why we’d made a distinction between teams_channels_members and groups_members in the first place.
几周过去了,仍有数十个查询需要迁移,我对冗余表和散乱的 SQL 查询感到沮丧。我越沮丧,项目似乎花费的时间就越长(为了更快到达终点,我越想偷工减料)。
As the weeks passed and there were still dozens of queries to migrate, I grew frustrated with the redundant tables and the way our SQL queries were strewn about. The more frustrated I became, the longer the project seemed to take (and the more tempting it became to cut corners in an attempt to reach the finish line faster).
完成重构后,我联系了我们几位早期工程师,想了解一下这些表格为何如此不同。我了解到,将私人和公共频道信息放在不同的表格中可以将它们隔离开来,起到安全防范的作用。产品历史也发挥了一定作用;在 Slack 早期,公共频道和私人频道给人的感觉是截然不同的概念。随着这两个概念逐渐融合,表格模式也逐渐融合。
After we had completed the refactor, I contacted a few of our early engineers to get some insight into why these tables had been distinct. I learned that keeping private and public channel information on separate tables isolated them from one another and served as a security precaution. Product history played a role as well; public channels and private channels felt like vastly different concepts in the early days of Slack. As the two concepts gradually converged, so did the table schemas.
获得这种观点对后续的重构很有帮助,它告诉我们如何将代码整合teams_channels到groups自己的统一表中。这让我对 Slack 早期做出的决策有了新的认识,并对重构持更积极的态度,认为重构是改进某些可能在一段时间内对我们很有帮助但现在不再有用的代码的机会,而不是改进“坏”代码的机会。正是这种经验让我在第 2 章中建议工程师花时间了解他们想要改进的代码来自哪里,以及随着时间的推移,情况可能导致代码退化的原因是什么。如果我们对代码有更多的同理心,我们就会在整个重构过程中保持更开放的心态和更耐心。
Gaining this perspective proved helpful for subsequent refactors, informing how we went about consolidating teams_channels and groups into their own unified table. It gave me a newfound appreciation for decisions made early in Slack’s history, and a more positive attitude toward refactoring as an opportunity to improve something that had probably served us well for some time but no longer could, rather than as an opportunity to improve “bad” code. This experience is precisely why in Chapter 2 I recommend that engineers take the time to understand where the code they seek to improve came from, and how circumstances may have led it to degrade over time. If we have more empathy for the code, we stand to keep a more open mind and be more patient throughout the refactor.
在第 1 章中,我断言在重构之前进行充分的测试覆盖非常重要,以确保应用程序的行为在每一步都得到适当的维护。在这个项目中,我们修改的绝大多数代码都是在 Slack 开发早期编写的,由于要尽快将产品推向市场,其中很多代码都缺乏充分的测试。整合频道会员表的重构也面临着巨大的时间压力;我们最大的客户的性能越来越受到关注,因此我们尽最大努力谨慎地进行必要的更改,选择只为最关键的未经测试的代码路径编写测试。
In Chapter 1, I asserted that it’s important to have adequate test coverage before refactoring, to ensure that the application’s behavior is properly maintained at every step. In this project, the vast majority of the code we were modifying had been written early in Slack’s development and, due to the push to get the product to market quickly, much of it lacked adequate tests. The refactor to consolidate the channel membership tables was under significant time pressure as well; performance for our largest customer was a growing concern, so we did our best to make the necessary changes carefully, opting to write tests for only the most critical untested codepaths.
这一决定导致我们在重构过程中出现了一些错误,如果我们花时间编写必要的测试,这些错误本可以避免。我们花在从引入的回归中恢复上的时间可能比我们一开始编写测试的时间还要多。拥有足够的测试覆盖率对于顺利进行重构至关重要,可以防止客户遇到错误,防止团队花时间解决错误。
This decision led us to ship a handful of bugs throughout the refactor, each of which could have been prevented had we taken the time to write the requisite tests. We arguably spent more time recovering from the regressions we introduced than we would have writing the tests in the first place. Having adequate test coverage is essential for a smooth refactor, preventing your customers from experiencing bugs and your team from spending time solving them.
与其继续独自努力,我本应该找到一种更好的方法,让其他工程师在开始时更认真地参与进来,并在几周后进展放缓时再次参与进来。最后 10% 的查询与前 50% 的查询迁移时间大致相同。一旦我们成功改进了 VLB 的提及查询,我们就开始失去项目开始时所经历的紧迫感。随着数据中每个新的错误或不一致,我们就会失去一点动力。当项目即将完成时,一切都感觉像是在把一块巨石推上山。
Rather than continuing to plow through alone, I should have found a better way to get other engineers involved more seriously at the outset and again when progress slowed a few weeks later. The last 10 percent of queries took about the same amount of time to migrate as the first 50 percent. Once we had successfully improved the mentions query for VLB, we began to lose the sense of urgency we had experienced at the start of the project. With every new bug or inconsistency in our data, we lost a little steam. By the time the project was nearly complete, everything about it felt like pushing a boulder up a mountain.
我们没有考虑过向我们团队以外的工程师寻求帮助。我们本可以更有策略地向其他产品工程团队的工程师寻求帮助,请他们迁移自己功能内的查询。我们本可以通过展示他们将获得的性能提升来说服他们。分散工作可以让我们将完成所需的时间减半。
What we had not considered was soliciting help from engineers outside our own team. We could have been more strategic about asking for help from those on other product engineering teams, asking them to migrate the queries within their own features. We could have sold them on the effort by demonstrating the performance boost they stood to gain. Distributing the work could have allowed us to halve the amount of time it took to complete.
如果重构的势头开始放缓,请在进度进一步放缓之前尽早寻求促进的方法。缓慢的重构更有可能失去优先级,留下大量卡在两种状态之间的代码,正如第 1 章所指出的那样,这会带来一系列问题。第 8 章介绍了多种保持团队积极性的方法;如果您需要更多支持,请毫不犹豫地寻求帮助!
If momentum on your refactor starts to slow, seek ways to give it a boost early, before progress slows further. Slow refactors are more likely to lose priority, leaving behind a significant amount of code stuck between two states, which, as was pointed out in Chapter 1, poses its own set of problems. Chapter 8 covered a number of ways to keep your team motivated; do not hesitate to ask for more support if you need it!
我们以查询计划的形式获得了初步数据EXPLAIN,以支持我们的假设,即合并两个成员资格表将提高查询性能。我们需要在重构的早期阶段进一步确认该假设,以便在合并被证明不够充分时可以进行调整。通过专注于仅进行必要的更改以启用 VLB 提及查询的迁移,我们在短短几周内就获得了所需的确认,并成功减轻了 VLB 数据库分片的负载,为我们赢得了更多时间来完成重构的其余部分。
We had preliminary data in the form of query EXPLAIN plans to support our hypothesis that combining the two membership tables would improve query performance. We needed further confirmation of that hypothesis during the early stages of the refactor so that we could pivot if the consolidation proved insufficient. By focusing on making only the changes necessary to enable the migration of the mentions query for VLB, we secured the confirmation we needed within just a few weeks and successfully alleviated load from the VLB database shard, buying us more time to see the remainder of the refactor through.
尽早证明重构的有效性可确保您的团队不会浪费任何时间继续执行可能无法产生预期结果的冗长项目。通过关注战略里程碑,那些原本应该从重构中受益的人可以更快地获得这些好处;这可以帮助您的团队,在工作仍在进行时进一步加强对工作的支持。有关如何确定战略里程碑的更多详细信息,请参阅第 4 章。
Proving your refactor’s effectiveness early ensures that your team does not waste any time continuing to execute a lengthy project that may not yield the desired results. By focusing on strategic milestones, those meant to benefit from the refactor can reap those benefits sooner; this can help your team, further bolstering support for the effort while it is still underway. For more details on how to identify strategic milestones, refer to Chapter 4.
我们有一套特定的指标,使我们能够确定我们的项目在中期里程碑方面取得了成功,并且在我们完成推广后,向所有客户推广。通过收集EXPLAIN整合前后的查询计划,我们能够在迁移每个更复杂的会员查询时记录进度。通过使用时间指标来检测提及查询,我们可以实时监控其性能并立即看到积极影响。
We had a specific set of metrics that enabled us to show conclusively that our project was successful for both our intermediate milestones and, once we’d completed the rollout, to all customers. By collecting EXPLAIN plans for queries before and after the consolidation, we were able to document progress as we migrated each of the more complex membership queries. By instrumenting the mentions query with timings metrics, we could monitor its performance in real time and immediately see the positive impact.
密切关注指标有助于证明重构在整个开发过程中朝着正确的方向发展。如果指标在任何时候停止改善(或更糟的是,开始倒退),您可以立即深入研究,在问题出现时立即解决问题,而不是在项目结束时。有关如何衡量重构的建议,请参阅第 3 章。
Keeping a close eye on your metrics helps you prove that your refactor is tilting the needle in the right direction throughout its development. If at any point the metrics stop improving (or worse, start regressing), you can dig in immediately, addressing problems as soon as they arise, rather than at the project’s conclusion. Refer to Chapter 3 for suggestions on how to measure your refactor.
以下是我们重构以整合 Slack 频道会员表的最重要的要点。
Here are the most important takeaways from our refactor to consolidate Slack’s channel membership tables.
制定详尽的书面计划并广泛分享。
Develop a thorough written plan and share it broadly.
花些时间去了解代码的历史;它可能会帮助你以新的、更积极的眼光看待它。
Take the time to understand the code’s history; it might help you see it in a new, more positive light.
确保要改进的代码有足够的测试覆盖率。如果没有,请致力于编写缺少的测试用例。
Ensure that there is adequate test coverage for the code you’re seeking to improve. If there isn’t, commit to writing the missing test cases.
保持团队积极性。如果您失去动力,请找到创造性的方法来重新振作起来。
Keep your team motivated. If you’re losing momentum, find creative ways to boost it back up.
关注战略里程碑,以尽早并经常证明重构的影响。
Focus on strategic milestones to prove the impact of your refactor early and often.
Identify and rely on meaningful metrics to guide your efforts.
在我们的两个案例研究章节中的第二章中,我们将探讨由 Slack 产品工程团队和基础设施团队的一组工程师进行的重构。该项目建立在上一章讨论的频道会员表合并的基础上。如果您尚未阅读第一个案例研究,我建议您这样做;您需要了解一些重要的背景信息,才能充分利用本章。
For the second of our two case study chapters, we’ll explore a refactor carried out by a group of engineers from the product engineering team and infrastructure teams at Slack. The project was built on the consolidation of our channel membership tables discussed in the previous chapter. If you haven’t read through the first case study yet, I recommend you do so; there’s important context you’ll want to understand to get the most out of this chapter.
与上一个主要受性能驱动的案例研究不同,本案例研究主要受 Slack 需要提高产品灵活性的驱动。将频道成员资格与不同的工作区分片绑定在一起,使我们难以构建超出单个工作区的更复杂功能。我们希望使拥有多个工作区的复杂组织能够在同一组频道内无缝协作,并促进不同 Slack 客户之间的沟通,使公司能够直接在应用程序内与供应商进行协调。为了解锁此功能,我们需要按用户和频道(而不是按工作区)重新分片频道成员资格数据。此重构说明了大规模数据库迁移、跨季度项目和大量跨职能工程工作所带来的诸多挑战。
Unlike the previous case study, which was primarily motivated by performance, this one was chiefly driven by Slack’s need to enable greater flexibility in the product. Having channel memberships tied to distinct workspace shards made it difficult for us to build more complex features stretching beyond single workspaces. We wanted to enable complex organizations with multiple workspaces to collaborate seamlessly within the same set of channels and facilitate communication between distinct Slack customers, allowing companies to coordinate with their vendors directly within the application. To unlock this ability, we needed to reshard channel membership data by user and channel rather than by workspace. This refactor illustrates the many challenges that come with large-scale database migrations, multi-quarter projects, and heavily cross-functional engineering efforts.
重构之所以成功,是因为我们非常清楚需要解决的问题,以及不断发展的产品策略如何让我们超越过去的架构决策(第 2 章)。我们精心规划了这个项目,选择处理比严格必要的更多变量,因为我们知道这会让重构更有价值(第 4 章)。我们制定了一项精心的推广策略,开发了使我们能够尽可能可靠地执行该策略的工具(第 8 章)。最后,在整个工作过程中,我们保持了简单的沟通策略。
The refactor was successful because we had a strong understanding of the problem we needed to solve and how our evolving product strategy had led us to outgrow past architectural decisions (Chapter 2). We planned the project thoughtfully, choosing to juggle a few more variables than were strictly necessary, knowing it would render the refactor even more worthwhile (Chapter 4). We derived a careful rollout strategy, developing tooling that enabled us to carry it out as reliably as possible (Chapter 8). Finally, throughout the entire effort, we maintained a simple communication strategy.
虽然重构最终让我们能够以新颖有趣的方式扩展我们的产品,但它花费的时间几乎是我们最初估计的两倍。我们的估计过于乐观(第 4 章);我们花了一年多的时间才完成我们最初预计只需六个月的工作。我们低估了重构对产品的影响,在花了几个月的时间却没有取得什么进展之后,我们才学会利用产品工程师的专业知识(第 6 章)。
Although the refactor ultimately gave us the ability to stretch our product in new and interesting ways, it took nearly double the time we had initially estimated to complete. We were too optimistic in our estimates (Chapter 4); it took over a year to finish what we had originally anticipated would take only six months. We underestimated the product implications of the refactor and only learned to leverage the expertise of product engineers after spending several months making little progress (Chapter 6).
与上一个案例研究一样,我们将从一些重要的背景开始,包括简要概述为什么我们分布数据的方式会成为瓶颈,以及我们采用新数据库技术 Vitess 背后的动机。一旦我们建立了坚实的基础和重构的动机,我们将描述我们的解决方案并逐步介绍项目的每个阶段。
As with the previous case study, we’ll start off with some important context, including a brief overview of why the way we distributed our data was becoming a bottleneck, and the motivations behind our adoption of a new database technology, Vitess. Once we’ve established a solid foundation and the motivations for our refactor, we’ll describe our solution and walk through each phase of the project.
为了理解我们试图通过此重构解决的问题,我们需要描述我们的数据在 MySQL 数据库中的分布情况。在开始重构之前,我们的绝大部分数据都是按工作区分片的,每个工作区对应一个 Slack 客户。我们在之前的案例研究“Slack 架构 101”中提到过这一点;您可以在图 10-3中看到不同客户的数据如何分布在不同分片上的图示。
To appreciate the problems we sought to solve with this refactor, we need to describe how our data was distributed across our databases in MySQL. Before we kicked off our refactor, the vast majority of our data was sharded by workspace, where a workspace is a single Slack customer. We touched on this in the previous case study under “Slack Architecture 101”; you can see an illustration of how different customers’ data was distributed across different shards in Figure 10-3.
尽管这种方法多年以来一直运行良好,但由于两个原因,这种分片方案变得越来越不方便。
While this worked just fine for a number of years, this sharding scheme grew increasingly inconvenient for two reasons.
首先,从运营角度来看,我们很难支持最大的工作区分片。我们最大、增长最快的客户所在的分片经常出现问题热点。这些客户已经占用了独立的分片,数据量很快就接近我们无法升级硬件空间的水平。由于没有简单的机制可以水平分割他们的数据,我们陷入了困境。
First, we struggled to support our biggest workspace shards from an operational perspective. The shards housing our largest, fastest-growing customers suffered from frequent, problematic hotspots. These customers, already occupying isolated shards, were quickly approaching the data size at which we would no longer be able to upgrade their hardware space. With no simple mechanisms by which we could horizontally split their data, we were stuck.
其次,我们对产品进行了重大变革,积极推动我们打破长期以来一直存在的工作区之间的障碍,无论是在代码编写方式还是数据结构方面。我们开发了一些功能,使我们最大的客户能够将多个工作区连接在一起,并推出了让两个不同的 Slack 客户在他们共享的频道内直接沟通的功能。
Second, we were making important changes in our product that were actively leading us to break down the barriers between workspaces we had long upheld, both in the way our code was written and in how our data was structured. We had built features enabling our biggest customers to bridge together multiple workspaces and launched the ability for two distinct Slack customers to communicate directly within a channel they shared.
我们的产品愿景与系统架构方式之间的不匹配意味着我们的应用程序变得越来越复杂。这是一个由于产品需求变化而导致代码质量下降的完美例子(您可能还记得第 2 章的内容!)。为了更具体地说明这个问题,在本案例研究之前的一年里,我们有时需要查询三个不同的数据库分片才能成功找到频道及其会员资格。这让我们的开发人员感到困惑,他们需要记住获取和操作与频道相关的数据的正确步骤。
The mismatch between our product vision and the way our systems were architected meant that our application grew ever more complex. This was a perfect example of code degradation due to shift in product requirements (as you might recall from Chapter 2!). To illustrate this problem more concretely, in the year leading up to this case study, we sometimes needed to query three distinct database shards to locate a channel and its memberships successfully. This was confusing for our developers, who needed to remember the correct set of steps to fetch and manipulate channel-related data.
为了解决 MySQL 的运营问题和扩展困难,我们开始评估其他存储选项。在权衡了多种解决方案之后,团队决定采用Vitess,这是 YouTube 构建的数据库集群系统,可以实现 MySQL 的水平扩展。通过迁移到 Vitess,我们终于能够通过工作区以外的其他方式对数据进行分片,让我们有机会释放最繁忙分片上的空间,并以一种让我们的工程师更容易推理的方式分发我们的数据!
To address our operational concerns with MySQL and our difficulty scaling, we started evaluating other storage options. After weighing multiple solutions, the team decided to adopt Vitess, a database clustering system built at YouTube that enables horizontal scaling of MySQL. With the migration to Vitess, we would finally be able to shard our data by something other than workspace, giving us the opportunity to free up space on our busiest shards and distribute our data in a way that made it easier for our engineers to reason out!
鉴于这些情况,我们决定将频道成员资格表迁移channels_members到 Vitess。由于这是我们流量最大的表之一,因此对其进行重新分片将释放大量空间并从最繁忙的工作区分片中加载。迁移还将大大简化跨工作区边界获取频道成员资格的业务逻辑。
Given these circumstances, we decided to migrate the channel membership table, channels_members, to Vitess. Because this was one of our most high-traffic tables, resharding it would free up considerable space and load from our busiest workspace shards. The migration would also substantially simplify business logic around fetching memberships for channels that existed across workspace boundaries.
该项目由 Vitess 基础设施团队牵头,并得到了几位产品工程师的帮助,他们对我们的应用程序针对channels_members表的查询模式有着深入的了解。我们知道这将是一个成功的组合。基础设施工程师将贡献对数据库系统的深入了解,以便我们能够避免迁移过程中的任何陷阱,并在出现数据库相关问题时有效地调试这些问题;因为他们迄今为止在表迁移方面拥有最多的专业知识,所以他们最适合领导这个项目,由 Maggie 掌舵。包括我在内的产品工程师将提供有关新模式和分片方案的关键见解,并帮助重写应用程序逻辑以正确查询迁移的数据。
The project was spearheaded out of the Vitess infrastructure team, with help from a handful of product engineers who had intimate knowledge of our application query patterns against the channels_members table. We knew it would be a winning combination. The infrastructure engineers would contribute deep knowledge of the database system so that we could avoid any pitfalls during the migration and efficiently debug database-related issues as they arose; because they had the most expertise with table migrations to date, they’d be best suited to lead the project, with Maggie at the helm. The product engineers, including me, would provide crucial insight as to the new schema and sharding scheme and help with rewriting application logic to query the migrated data correctly.
我们通过创建一个新频道 #feat-vitess-channels 来正式启动这个项目,在这个频道中,我们可以轻松地交流想法并协调工作流程。我们邀请所有人加入,然后直接开始我们的第一个任务。
We kicked things off in earnest by creating a new channel, #feat-vitess-channels, where we could easily bounce ideas off one another and coordinate workstreams. We invited everyone to join and jumped right into our first task.
在开始将频道成员资格数据迁移到 Vitess 之前,我们需要确定如何分发数据(即使用哪些键来重新分片表)。在这里,我们有两个选择:
Before we could begin migrating channel membership data to Vitess, we needed to decide how it would be distributed (i.e., which keys to use to reshard the table). Here, we had two options:
按频道(channel_id),通过查询单个分片轻松找到与频道相关的所有会员资格
by channel (channel_id), to locate all memberships associated with a channel easily by querying a single shard
按用户(user_id),通过查询单个分片查找用户的所有会员资格
by user (user_id), to find all of a user’s memberships by querying a single shard
最近,我们根据第一个案例研究完成了会员表格的整合,我的印象是,大多数查询都是针对特定频道而不是特定用户获取会员身份。其中许多查询对于应用程序至关重要,支持搜索等重要功能,以及提及频道中所有人(通过@channel或@here)的功能。
Having recently completed the consolidation of our membership tables per our first case study, my impression was that the majority of queries dealt with fetching membership for a given channel rather than for a given user. Many of these queries were crucial to the application, powering important features like Search, and the ability to mention everyone in a channel (via @channel or @here).
当时(直到今天),我们将所有数据库查询的样本记录到我们的数据仓库中,以监控对生产系统的请求的 MySQL 使用情况。为了证实我的直觉,即大多数流量都依赖channels_members于此,我对这些数据运行了一些查询,查看了一个月内执行的抽样成员资格查询,并将其提交给团队。结果如图 11-1channel_id所示。
At the time (and still today), we logged a sample of all database queries to our data warehouse to keep tabs on our MySQL usage across requests to our production systems. To confirm my intuition that most of the traffic to channels_members relied on channel_id, I ran a few queries against this data, looking at sampled membership queries executed over a month-long period, and brought it to the team. The results are shown in Figure 11-1.
channels_members过滤后的查询运行次数channel_id与我们合作的一位产品工程师对 Vitess 有更多经验,他指出按用户分片可能是更好的选择。从同一组查询日志中提取的数据中,他向我们展示了按 筛选的表中前 10 个最频繁的查询。结果如图 11-2user_id所示。如果我们希望我们的应用程序表现良好,我们就需要考虑这种行为。
One of the product engineers working with us, who had more experience with Vitess, pointed out that sharding by user might be a better bet. Pulling from the same set of query logs, he showed us the top 10 most frequent queries hitting the table filtered by user_id. The results are shown in Figure 11-2. If we wanted our application to perform well, we would need to account for this behavior.
channels_members以及它们是否按以下方式筛选数据user_id我们权衡了这两种选择,做了一些粗略的计算来确定支持这两种选择所需的数据库查询容量。我们最终决定妥协,将会员身份非规范化为两个表,一个按用户分片,另一个按渠道分片,两种用例都进行双重写入。这样,点查询对两者来说都很便宜。
We weighed both options, doing some back-of-the-napkin math to determine the database querying capacity required to support either option. We ultimately decided to compromise, denormalizing the membership into two tables, one sharded by user, the other sharded by channel, double-writing for both use cases. This way, point queries would be cheap for both.
接下来,我们需要仔细研究现有的工作区分片表模式,并确定是否要针对用户和通道分片用例对其进行修改。虽然我们可以将现有模式迁移到这两种分片方案,但这次重构为我们提供了一个独特的机会来重新思考我们在原始表设计中做出的一些决定。我们将仔细研究为每个分片派生的模式,从用户分片开始。示例 11-1显示了迁移之前工作区分片上的模式。
Next, we needed to take a hard look at our existing workspace-sharded table schema and determine whether we wanted to modify it for both our user- and channel-sharded use cases. Although we could have migrated our existing schema to both sharding schemes, this refactor gave us a unique opportunity to rethink some of the decisions we’d made with the original table design. We’ll take a closer look at the schema we derived for each, starting with the user shard. Example 11-1 shows the schema on the workspace shards, before the migration.
CREATE TABLE现有表的语句channels_membersCREATETABLE`channels_members`(`user_id`bigint(20)unsignedNOTNULL,`channel_id`bigint(20)unsignedNOTNULL,`team_id`bigint(20)unsignedNOTNULL,`date_joined`int(10)unsignedNOTNULL,`date_deleted`int(10)unsignedNOTNULL,`last_read`bigint(20)unsignedNOTNULL,...`channel_type`tinyint(3)unsignedNOTNULL,`channel_privacy_type`tinyint(4)unsignedNOTNULL,...`user_team_id`bigint(20)unsignedNOTNULL,PRIMARYKEY(`user_id`,`channel_id`))
CREATETABLE`channels_members`(`user_id`bigint(20)unsignedNOTNULL,`channel_id`bigint(20)unsignedNOTNULL,`team_id`bigint(20)unsignedNOTNULL,`date_joined`int(10)unsignedNOTNULL,`date_deleted`int(10)unsignedNOTNULL,`last_read`bigint(20)unsignedNOTNULL,...`channel_type`tinyint(3)unsignedNOTNULL,`channel_privacy_type`tinyint(4)unsignedNOTNULL,...`user_team_id`bigint(20)unsignedNOTNULL,PRIMARYKEY(`user_id`,`channel_id`))
对于用户分片的情况,我们决定保留原始架构的大部分内容,但有一个例外:我们对存储用户 ID 的方式进行了重大更改。为了理解这一决定背后的动机,我们将简要概述我们存储的两种用户 ID 以及它们的产生方式。
For the user-sharded case, we decided to maintain the majority of the original schema, with one exception: we made a significant change to how we stored user IDs. To understand the motivations behind this decision, we’ll give a brief overview of the two kinds of user IDs we stored and how they came about.
在本章开头,我们简要提到,Slack 致力于让复杂的业务(根据部门或业务单位分为多个工作区)更轻松地协作。如果没有任何集中化,员工不仅难以跨部门沟通,公司也难以妥善管理每个单独的工作区。为此,我们帮助我们最大的客户将他们的许多工作区整合到一个单一的保护伞下。
At the start of the chapter, we briefly mentioned that Slack sought to enable complex businesses, split into multiple workspaces according to department or business unit, to collaborate more easily. Without any centralization, not only did employees have difficulty communicating across departments, it was also difficult for the company to manage each individual workspace properly. To this end, we enabled our biggest customers to bring together their many workspaces under a single umbrella.
不幸的是,在对工作区进行分组时,我们需要一种方法来保持用户同步。让我们用一个简单的例子来说明这是如何工作的。
Unfortunately, in grouping workspaces, we needed a way to keep users in sync. Let’s illustrate how this works with a simple example.
Acme Corp. 是一家大型公司。它有许多部门,每个部门都有自己的工作区,其中包括工程团队和客户体验部门的工作区。作为 Acme Corp. 的员工,您拥有一个组织级别的用户帐户。如果您恰好是一名工程师,那么您就是工程工作区的成员,可以与您的队友协作,也可以成为客户体验工作区的成员,帮助支持团队解决客户问题。
Acme Corp. is a large corporation. It has a number of departments, each with its own workspace, including one for its engineering team and customer experience department. As an employee of Acme Corp., you have a single, organization-level user account. If you happen to be an engineer, you are a member of the Engineering workspace to collaborate with your teammates, and the Customer Experience workspace to help the support team troubleshoot customer issues.
然而,Acme Corp. 的单个帐户实际上在后台有多个帐户。在组织级别,用户有一个规范的用户 ID。同一用户在其所属的每个工作区中都有不同的本地用户 IDn + 1 。这意味着,如果您是工程和客户体验工作区的成员,那么您有三个唯一的用户 ID,或者概括地说, ID,其中n是您所属的工作区数量。
What appeared to be a single account at Acme Corp., however, was actually multiple accounts under the hood. At the organization level, a user had a canonical user ID. The same user had distinct local user IDs for each workspace they were a member of. This means that if you were a member of the Engineering and Customer Experience workspaces, you had three unique user IDs, or, to generalize, n + 1 IDs, where n was the number of workspaces of which you were a member.
可以想象,在这些 ID 之间进行转换很快变得非常复杂,而且容易出错。在推出此功能的一年内,一些产品工程师制定了一项计划,将所有本地用户 ID替换为规范用户 ID。由于 Slack 系统中存储的大多数数据都引用某种类型的用户 ID(撰写消息、上传文件等),因此正确(且隐形)重写这些 ID 涉及高度复杂性。
As you might imagine, translating between these IDs quickly became exceedingly complicated and bug-prone. Within a year of launching this feature, a number of product engineers hatched a plan for replacing all local user IDs with canonical user IDs. Because most of the data stored in Slack’s systems refer to a user ID of some kind (authoring a message, uploading a file, etc.), a high degree of complexity was involved with correctly (and invisibly) rewriting these IDs.
工作区分片表将本地用户 IDchannels_members存储在列中。由于已有一个项目正在进行中,旨在将所有本地用户 ID 替换为规范用户 ID,因此我们决定与他们合作,并确保在所有用户 ID 列中存储规范用户 ID。user_id
The workspace-sharded channels_members table stored local user IDs in the user_id column. Because a project was already underway to replace all local user IDs with canonical user IDs, we decided to collaborate with them and ensure that we stored canonical user IDs across all user ID columns.
除了对用户 ID 的担忧之外,我们对次要的、按通道分片的成员资格表的写入带宽也有些担心。我们检查了计划发送到这些分片的查询,试图找出减少写入流量的方法。在此过程中,我们注意到原始表上的大多数列都没有被其使用者使用,包括最常更新的列,例如用户在通道中的最后读取位置。例如,如果我们查询与给定通道相关的所有成员资格,应用程序逻辑通常只会使用user_id和user_team_id列。通过在我们的新架构中省略这些不必要的列,我们可以显著降低写入频率,为我们的通道分片提供更多的喘息空间。示例 11-2显示了按通道分片的成员资格表的表架构。
Beyond our concerns with user IDs, we had some unease about the write bandwidth to the secondary, channel-sharded membership table. We examined the queries we planned to send to these shards to try to identify ways we could decrease write traffic. During that process, we noticed that most of the columns on the original table were entirely unused by their consumers, including the ones that were updated most often, like a user’s last read position in the channel. For example, if we queried for all the memberships associated with a given channel, the application logic would usually only use the user_id and user_team_id columns. By omitting these unnecessary columns in our new schema, we could dramatically decrease the write frequency, giving our channel shards a bit more breathing room. Example 11-2 shows the table schema for the channel-sharded membership table.
CREATE TABLE第二张新表的语句channels_members,按渠道分片CREATETABLE`channels_members_bychan`user_id`bigint(20)unsignedNOTNULL,`channel_id`bigint(20)unsignedNOTNULL,`user_team_id`bigint(20)unsignedNOTNULL,`channel_team_id`bigint(20)unsignedNOTNULL,`date_joined`int(10)unsignedNOTNULLDEFAULT'0',PRIMARYKEY(`channel_id`,`user_id`))
CREATETABLE`channels_members_bychan`user_id`bigint(20)unsignedNOTNULL,`channel_id`bigint(20)unsignedNOTNULL,`user_team_id`bigint(20)unsignedNOTNULL,`channel_team_id`bigint(20)unsignedNOTNULL,`date_joined`int(10)unsignedNOTNULLDEFAULT'0',PRIMARYKEY(`channel_id`,`user_id`))
接下来,我们需要更新应用程序逻辑以适应模式的更改并指向 Vitess 集群。幸运的是,大多数更改都很简单,不知不觉中,我们已经相应地更新了大部分应用程序逻辑。
We next needed to update our application logic to accommodate the changes to our schemas and point to the Vitess cluster. Thankfully, most of these changes were straightforward and before we knew it, we’d updated the majority of our application logic accordingly.
迁移变得更加困难的是涉及JOINMySQL 集群中其他表的复杂查询。由于我们要将表移动到一个全新的集群,因此我们无法再支持这些查询,只能将它们拆分为较小的点查询,JOIN直接在应用程序代码中执行。
Where the migration became more difficult was with complex queries involving JOINs with other tables in our MySQL cluster. Because we were moving the table to an entirely new cluster, we could no longer support these queries and had to split them up into smaller point queries, performing the JOIN directly in the application code.
我们在项目开始时就知道,我们可能需要拆分一些JOIN查询。我们没有预料到的是,它们中的大多数都支持 Slack 的核心功能,并且经过了多年的精心手动调整以提高性能。通过拆分这些查询,我们冒着各种风险,从减慢通知速度,到引入数据泄露,再到完全降低 Slack 的性能。我们非常紧张,但我们需要继续前进。
We knew at the project’s outset that we would likely need to split up a handful of JOIN queries. What we did not anticipate was that most of them powered core Slack features and had been carefully hand-tuned for performance over a number of years. By splitting up these queries, we risked anything from slowing down notifications, to introducing data leaks, to bringing down Slack entirely. We were pretty nervous, but we needed to push on.
我们暂停了日常迁移工作,并编制了一份我们最关心的查询列表,其中有 20 个。仔细研究了这组查询后,我们担心自己没有足够的产品专业知识来充分理清每一个查询。我们估计,如果没有产品工程部门的任何额外帮助,我们需要几个月的时间才能JOIN成功理清每一个查询。幸运的是,许多产品工程师响应了我们的求助,我们一起开发了一个简单的流程,可以安全地拆分每个查询。
We put the day-to-day migrations on pause and compiled a list of the queries we were most concerned about, of which there were 20. Poring through the set, we worried that we didn’t have the product expertise required to adequately detangle each and every one. We estimated that without any additional help from product engineering, we’d need months to detangle each of the JOINs successfully. Fortunately, a number of product engineers responded to our call for help and together we developed a simple process that we could apply to split up each query safely.
为了说明每个步骤,我们将演示如何拆分示例 11-3中所示的查询,该查询负责决定用户是否有权查看特定文件。
To illustrate each step, we’ll walk through how we split up the query shown in Example 11-3, which was responsible for deciding whether a user had permission to see a specific file.
我们首先需要确定我们能够尽早获取的最小数据子集;这将帮助我们尽早最小化我们需要处理的数据交集。
We first needed to identify the smallest subset of data we could fetch earliest; this would help us minimize the intersection of data we needed to work with as early as possible.
通过文件可见性查询,我们从典型的使用模式中知道,共享文件的位置数量通常比用户所在的频道数量少得多。(我们也可以通过查看查询的基数来验证这一假设。)因此,我们不是先查询用户的频道会员资格,然后再交叉引用这些会员资格与共享文件的频道,而是先获取共享文件的位置,然后确定用户是否在这些频道中的任何一个中。您可以在示例 11-4中看到查询分为两个部分的示例。
With the file visibility query, we knew from typical usage patterns that the number of places where a file was shared was usually much smaller than the number of channels that a user was in. (We could also verify this assumption by looking at a query’s cardinality.) So, instead of first querying for a user’s channel memberships and cross-referencing those with the channels where the file was shared, we fetched the locations where the file was shared first and then determined whether the user was in any of these channels. You can see an example of the query split up into its two components in Example 11-4.
然后,我们验证了测试覆盖范围是否足够。如果不够,我们将编写一些额外的测试用例来验证原始查询的结果。一旦我们满意,我们将新逻辑包装在实验中,以便逐步推出,并使我们能够在紧急情况下快速回滚。我们对两种实现都进行了测试,修复了任何出现的错误,并重复了这个过程,直到我们对新逻辑有信心为止。最后,我们用一些时间指标对这两个调用进行了检测,以跟踪JOIN和其解缠版本的执行时间。示例 11-5粗略概述了使用两个查询实现和相应的检测进行文件可见性检查的情况。
We then verified that the test coverage was sufficient. If it wasn’t, we would write a few additional test cases to verify the results of the original query. Once we were satisfied, we wrapped the new logic in an experiment to enable a gradual rollout and give us the ability to rollback quickly in an emergency. We ran our tests against both implementations, fixed any bugs that crept up, and repeated the process until we felt confident with our new logic. Finally, we instrumented both calls with some timings metrics to track the execution time of both the JOIN and its detangled version. Example 11-5 provides a rough outline for what the file visibility check looked like with both query implementations and corresponding instrumentation.
对于风险较高的查询拆分(包括文件可见性),我们与质量保证团队合作,在向更多用户推出之前,手动验证了开发环境和生产环境中的更改。我们试图理清的大多数 JOIN 都与关键的 Slack 功能有关,因此我们希望特别小心,确保我们的更改完美地复制了预期的行为。
For the riskier query splits (including file visibility), we worked with the quality assurance team to manually verify the change in both our development environments and production before rolling it out to more users. The majority of the JOINs we sought to detangle dealt with critical Slack functionality, so we wanted to be particularly careful that our changes perfectly replicated intended behavior.
在将新实施推广到实际客户之前,我们先将其启用到我们自己的内部 Slack 实例中。这是确认我们正确提取时间指标并进一步确保我们没有无意中引入错误的重要步骤。
We enabled the new implementation to our own internal Slack instance before rolling it out to real customers. This was an important step to confirm that we were properly ingesting timings metrics and further ensure that we had not unintentionally introduced a bug.
Slack 的工作区有各种各样的怪癖,我们的使用模式并不总是与客户的使用模式相匹配。虽然它通常可以作为及早发现错误的良好试金石,但工作区并不适合帮助我们确定解缠查询的增加延迟是否可以接受。对于一部分用户JOIN,解缠查询的性能在我们自己的工作区上尤其严重,随着我们继续向免费团队推出,然后是更大的付费客户,指标趋于稳定。
Slack’s workspace has all sorts of quirks, and our usage patterns don’t always match those of our customers. While it often makes for a decent litmus test for catching bugs early, the workspace was not a suitable candidate to help us determine whether the added latencies of the detangled queries were acceptable. For a subset of the JOINs, performance of the detangled queries was particularly aggravated on our own workspace, and as we continued the rollout to free teams, followed by larger paying customers, the metrics stabilized.
我们对几乎每一个都重复了这个过程JOIN,小心翼翼地将查询拆分开来,对其进行检测,然后逐步将它们推广给客户。唯一的例外是两个令人讨厌的 mentions 查询,我们几个月都没有动过它们。不幸的是,这些查询带来了许多独特的挑战,包括针对JOIN正在进行 Vitess 迁移的表的查询。我们决定推迟迁移它们,直到它们的所有子组件都正确就位。总的来说,我们五个人断断续续花了大约六周的时间,我们的时间被分配在重构和其他承诺上,才完成了大部分查询的迁移JOIN。
We repeated the process for nearly every JOIN, gingerly slicing queries apart, instrumenting them, and gradually rolling them out to customers. The only exception was two pesky mentions queries, which we left untouched for several months. Unfortunately, these queries posed a number of unique challenges, including JOINs against tables that were undergoing their own Vitess migration. We decided to defer on their migration until all their subcomponents had properly fallen into place. Overall, five of us took about six weeks on and off, with our time split between the refactor and other commitments, to finish migrating the majority of the JOINs.
重构通常不会完全按照计划进行;我们遇到障碍,需要我们重新调整优先级,或者在某些情况下,在某个步骤中途停止,以便稍后再回来。虽然暂停和换档感觉非常不尽人意,但有时它可以对我们及时交付整个项目的能力产生巨大影响。
It’s often the case that refactors don’t go entirely according to plan; we encounter hurdles that require us to reshuffle priorities or, in some cases, stop partway through a given step in favor of coming back to it later. Although it feels deeply unsatisfying to hit pause and shift gears, it can sometimes make a huge difference in our ability to deliver the overall project in a timely manner.
对于这项工作,如果我们等待我们依赖的其余迁移完成,那么重构将推迟几个月。通过选择继续执行channels_members我们成功重写的绝大多数查询,我们能够继续取得进展,发现问题并及时发现它们;当终于到了重新审视提及查询的时候,我们处于一个更加稳定的位置来做这件事。
For this effort, had we waited for the remainder of the migrations we were depending on to land, it would have set the refactor back by several months. By instead choosing to move forward with the vast majority of the channels_members queries we had successfully rewritten, we were able to continue making headway, uncovering issues as they crept up; when the time finally came to revisit the mentions queries again, we were in a much stabler place to do so.
当我们开始迁移时channels_members,每秒总查询数 (QPS) 的约 15% 由 Vitess 提供支持。我们已经迁移并重新分片了关键工作负载,例如与通知相关的表以及teams负责列出每个 Slack 客户实例的表。我们构建了可靠的技术和工具来促进近 20 次迁移,并配备了仪表板和用于有效比较新旧集群之间数据集的框架。
When we began our migration of channels_members, approximately 15 percent of our total queries per second (QPS) was powered by Vitess. We’d already migrated and resharded critical workloads, such as notifications-related tables, and the teams table responsible for listing each Slack customer instance. We had built reliable techniques and tooling to facilitate nearly 20 migrations, complete with dashboards and a framework for efficiently comparing data sets across the old and new clusters.
然而,这次迁移channels_members的独特之处在于,它本身就占了我们总查询负载的近 20%,几乎是我们迄今为止在 Vitess 上管理的 QPS 的两倍。由于规模庞大,我们担心在迁移过程中遇到意外问题。尽管如此,我们还是积极地将这些更大的工作负载从 MySQL 中移出,因为它在我们最大客户的负载下苦苦挣扎。我们陷入了进退维谷的境地。
The channels_members migration was unique, however, in that it alone accounted for nearly 20 percent of our total query load, nearly doubling the QPS we had learned to manage on Vitess to date. Because of the scale, we were nervous about running into unexpected issues during the migration. That said, we were highly motivated to move these more sizable workloads off of MySQL, because it was struggling under the load of our largest customers. We were stuck between a rock and a hard place.
我们最好的选择是大量依赖我们在之前的 Vitess 迁移过程中构建的迁移工具。我们希望它对于这个表来说也足够稳定。
Our best bet was to lean heavily on the migration tooling we’d built during previous Vitess migrations. We hoped it would be stable enough for this table as well.
我们为实现迁移而开发的推出流程包括四种高级模式:
The rollout process we had developed for enabling migrations consisted of four high-level modes:
期间在这个阶段,我们向新集群(使用新的分片方案)和旧集群都进行了双重查询。这种模式进一步使我们能够使用旧集群中的现有数据来填充新集群。
During this stage, we double-wrote queries to both the new cluster (with the new sharding scheme) and to the old cluster. This mode further allowed us to backfill our new cluster with existing data from the old cluster.
这模式将读取流量发送到两个集群并比较结果,记录从新 Vitess 集群检索到的数据中的任何差异。读取流量的消费者将获得从旧集群检索到的结果。
This mode sent read traffic to both clusters and compared the results, logging any discrepancies in the data retrieved from the new Vitess cluster. Consumers of the read traffic were provided with results retrieved from the old cluster.
这模式将读取流量发送到两个集群,再次比较结果并记录出现的任何差异。但是,Vitess 结果不是从旧集群返回,而是返回到应用程序。
This mode sent read traffic to both clusters, again comparing results and logging any discrepancies as they arose. However, instead of returning results from the old cluster, Vitess results were returned to the application.
期间在这个阶段,我们继续对两个集群进行双重写入,但只向 Vitess 集群发送读取请求。这种模式使我们能够停止从两个不同的数据源读取的昂贵过程,同时使任何下游消费者能够继续依赖存储在旧集群中的数据,直到他们更新为从 Vitess 读取。(这包括我们的数据仓库等系统。)在这个阶段,如果发现任何问题,唯一的选择就是向前修复;没有简单或安全的方法可以回到使用旧数据源的数据。
During this stage, we continued to double-write to both clusters but send read requests strictly to the Vitess cluster. This mode allowed us to discontinue the expensive process of reading from two distinct data sources, all the while enabling any downstream consumers to continue to rely on data stored in the old clusters until they were updated to read from Vitess. (This included systems such as our data warehouse.) At this stage, if any problems were uncovered, the only option was to fix forward; there was no easy or safe way to go back to consuming data from the legacy data source.
快速、简单的配置部署使我们能够轻松地在模式之间切换,以及在单一模式内提升和降低。该系统还为我们提供了相当精细的控制,我们可以快速选择不同层级的客户和用户进入不同的模式。我们利用能够调整这些设置的优势,在遇到任何问题时快速提升和降低。
Fast, simple configuration deployments enabled us to swap easily between modes as well as ramp up and down within a single mode. The system also provided us with rather granular controls, whereby we could swiftly opt tiers of customers and users into distinct modes. We took advantage of being able to tweak these settings to ramp up and back down rapidly when we encountered any issues.
每次迁移都是从回填模式开始的。在这种模式下,有两个主要目标。第一个目标是为从旧集群完全回填数据做好准备,为读取查询的迁移做准备。对于我们之前的大多数迁移,这个阶段非常简单;新集群的写入查询将与旧集群的相应写入查询相同(或几乎相同)。因为我们正在积极更改数据模型,所以我们最终不得不重写许多应用程序的 SQL 查询以符合我们的新模式(包括share_type正确传播和将本地用户 ID 转换为其规范对应项)。幸运的是,由于第 10 章中讨论的先前合并,我们能够轻松识别每个需要重写的查询。
Every migration began with Backfill mode. In this mode, there were two primary goals. The first goal was to set the stage for running a complete backfill of the data from the old cluster in preparation for the migration of read queries. For the majority of our previous migrations, this phase was quite simple; the write queries for the new cluster would be identical (or nearly so) to the corresponding write queries to the old cluster. Because we were actively changing the data model, we ended up having to rewrite many of our application’s SQL queries to conform to our new schema (including propagating the share_type correctly and translating local user IDs to their canonical counterparts). Luckily, thanks to the prior consolidation discussed in Chapter 10, we were able to readily identify each query requiring a rewrite.
第二个目标是揭示与新集群的写入负载相关的任何性能问题。对于大多数此类迁移,我们认为Backfill和Dark模式对生产中的应用程序的性能影响相对较小(如果有的话)。这主要是因为:
The second goal was to unveil any performance problems associated with write load to the new cluster. For most of these migrations, we considered the Backfill and Dark modes to have relatively little (if any) performance impact on the application in production. This was primarily because:
我们利用 Hacklang 的async合作社多任务模式同时向两个集群发送查询。我们在 Vitess 中对到达新集群的查询设置了一个短暂的一秒(1s)超时,因此在最坏的情况下,这些查询的性能损失将是 1s 减去从旧集群执行查询所花费的时间。
We used Hacklang’s async cooperative multitasking mode to send queries to both clusters concurrently. We set a short, one-second (1s) time-out on the query hitting the new cluster in Vitess, so that in the very worst case, the performance penalty for these queries would be 1s minus the time it took to execute the query from the old cluster.
我们尚未将结果从 Vitess 集群返回给应用程序!这将在Light模式下发生。
We were not yet returning the results to the application from the Vitess cluster! This would occur in Light mode.
这次迁移再次证明我们的假设是错误的。我们要迁移到的用户分片 Vitess 数据库集群channels_members已经填充了使用率很高的生产数据(包括已保存的消息和通知)。随着我们逐步启用Backfill模式,Vitess 上的数据库资源开始饱和,导致对集群上已驻留的关键表的查询超时和错误。深入研究后,我们发现有许多更新和删除查询缺少分片键(user_id),因此分散在集群中的每个分片上。我们进行了配置更改,以便这些查询可以更高效地运行,然后试探性地启动了第二次逐步启用Backfill模式。我们很快就达到了 100% 并开始下一阶段,黑暗模式!
Again, our assumptions proved wrong with this migration. The user-sharded Vitess database cluster to which we were moving channels_members was already populated with highly used production data (including saved messages and notifications). As we ramped up Backfill mode, we began saturating database resources on Vitess, leading to time-outs and errors for queries to the critical tables already residing on the cluster. Digging in, we discovered that we had a number of update and delete queries lacking our sharding key (user_id), thus scattering them across every shard in the cluster. We made a configuration change so that these could run more efficiently, and then tentatively kicked off a second gradual ramp-up of Backfill mode. We quickly reached 100 percent and began the next stage, Dark mode!
我们认真地进入了重构的暗黑模式部分,仔细重写了大部分channels_members查询(包括许多麻烦的 JOIN)以从 Vitess 读取,并在短短三个多月内成功完成了回填过程。由于我们的迁移系统使我们能够将查询子集选择到不同的阶段(即,一个查询可以处于暗黑模式,而另一个查询处于浅色模式),为了尽可能多地并行重构,我们在重写所有查询以正确从 Vitess 集群读取之前就开始增加暗黑模式。
We entered the Dark mode portion of the refactor in earnest, having carefully rewritten most of the channels_members queries (including many of the troublesome JOINs) to read from Vitess, and successfully completed the backfill process in just over three months. Because our migration system enabled us to opt subsets of queries into different phases (i.e., one query could be in Dark mode while another was in Light mode), in an effort to parallelize as much of the refactor as possible, we began to ramp up Dark mode before we’d rewritten all our queries to read from the Vitess cluster properly.
与Backfill模式一样,暗黑模式有两个主要目标。同样,我们的目标之一是揭示与发送到新集群的读取流量相关的任何潜在性能问题。
Dark mode, as with Backfill mode, had two primary goals. Once again, one of our objectives was to reveal any potential performance problems associated with the read traffic being sent to the new cluster.
当我们开始增加流量以与旧系统同时从 Vitess 读取数据时,我们注意到少数具有高 QPS 的查询返回了惊人的行数。高 QPS 与大量返回行的组合使得每秒返回的总行数成为我们集群中最大的。图 11-3显示,在峰值时,我们从单个分片的channels_members表每秒返回大约 9,000 行。事实上,这些查询非常频繁且占用大量内存,以至于它们导致内存不足错误 (OOM) 淹没数据库主机本身!在我们增加流量后的几天里,我们每天都看到 1/256 的主机内存耗尽。
As we began ramping up traffic to read from Vitess concurrently with our legacy system, we noticed that a handful of queries with high QPS returned an alarming number of rows. The combination of high QPS with the large number of rows returned made the overall rows returned per second the largest in our cluster. Figure 11-3 shows that at peak, we were returning about 9,000 rows per second from a single shard’s channels_members table. In fact, these queries were so frequent and memory-intensive that they caused out-of-memory errors (OOMs) to flood the database host itself! During the days following our ramp-up, we saw 1/256 of our hosts running out of memory every day.
起初,我们认为是云提供商出了问题;也许是我们配置最大数据库集群的方式出了问题。最后,我们意识到这不是配置失误或偶然的运气不好,于是我们迅速采取措施,开始隔离 OOM 的来源。
At first, we believed that our cloud provider was at fault; perhaps something was wrong with the way we had provisioned our largest database cluster. Eventually, we realized that it wasn’t a configuration mishap or random bad luck, and we swiftly ramped down to start isolating the source of the OOMs.
图 11-4显示了我们在每周状态更新期间对 OOM 感到意外的情况。
Figure 11-4 shows our surprise with the OOMs during our weekly status update.
重构是基础设施内部和整个工程组织中的高优先级项目。从数据库可靠性的角度来看,迁移channels_members到 Vitess 是继续开发围绕新系统操作的肌肉记忆的重要一步,因此当 OOM 被证明特别难以捉摸时,我们开始与 Slack 的整个数据库团队合作,直接在我们为协调工作而设置的频道 #feat-vitess-channels 中从各个角度进行调试。我们尝试调整 MySQL 进程的内存分配大小,深入研究 MySQL 和操作系统级别的内存碎片和分配。在此过程中,我们升级了 MySQL 的次要版本,以访问允许我们为缓冲池指定非均匀内存访问 (NUMA) 交错策略的新设置!同时,我们继续拆分更多JOINs,并开始增加更多暗模式查询负载。每次,我们都以为我们可能会停止遇到 OOM,但令人失望的是,随着我们增加更多负载,我们不断遇到它们。
The refactor was a high-priority project both within the infrastructure and among engineering organizations at large. From a database reliability perspective, moving channels_members to Vitess was an important step in continuing to develop our muscle memory around operating the new system, so when the OOMs proved particularly elusive, we began working with the entire database team at Slack, debugging from all angles directly in the channel we had set up to coordinate the effort, #feat-vitess-channels. We attempted to resize the memory allocation for our MySQL processes, digging into memory fragmentation and allocation at both the MySQL and operating system levels. During this process, we upgraded minor versions of MySQL to have access to a new setting that allowed us to specify the nonuniform memory access (NUMA) interleave policy for the buffer pool! Meanwhile, we continued to split up more JOINs, and began ramping up more Dark mode query load. Each time, we thought we might stop encountering OOMs, only to be disappointed as we kept encountering them as we ramped up more load.
此时,项目刚刚超过六个月,超出了我们最初的估计;整个团队都感觉我们一直在前进两步,后退一步。经过数周的反复试验,我们发现 Slack 的其他存储系统(包括我们的监控集群和搜索集群)遇到了限制性较小的值问题min_free_kbytes,这是一个低级内核设置,负责控制内核决定释放内存的积极程度。值越大,内核通过释放更多 RAM 中保存的数据给自己留出的喘息空间就越大。由于大量查询以高 QPS 返回大量行,我们会偶尔遇到需要突然分配大量 RAM 的请求高峰,从而导致 OOM,因为内核无法足够快地释放 RAM 以返回结果。将其提高min_free_kbytes到更高的值使我们的主机能够更好地管理与这些查询相关的内存压力,并最终解决了我们的 OOM。
At this point, the project had just surpassed the six-month mark, obliterating our initial estimate; the whole team very much felt as through we were consistently taking two steps forward and one step back. After weeks of trial and error, we discovered that other storage systems at Slack (including our monitoring cluster and Search cluster) had hit problems with a restrictively small value for min_free_kbytes, a low-level kernel setting responsible for controlling how aggressively the kernel decides to free memory. The larger the value, the more breathing room the kernel will give itself by shedding more data held in RAM. With the substantial number of queries returning a large number of rows at high QPS, we would sporadically hit spikes of requests that required a sudden allocation of a large amount of RAM, leading to OOMs, because the kernel couldn’t free RAM quickly enough to return results. Bumping this min_free_kbytes to a higher value enabled our hosts to manage the memory pressure associated with these queries better and finally resolved our OOMs.
我们在黑暗模式阶段花了整整八个月的时间;我们在这个阶段花费的时间不仅比我们最初预计在整个项目上花费的时间要多,而且在我们完成后,它占了整个项目的近三分之二。发生了什么?
We spent eight whole months in the Dark mode phase; not only did we spend more time in this phase alone than we had initially anticipated spending on the project as a whole, it accounted for nearly two-thirds of the entire endeavor once we’d completed it. What happened?
考虑到我们的配置变化,我们可以轻松地将 100% 的流量增加到 Vitess 集群,而不会影响整个站点的性能。此时,几乎所有的 JOIN 都已解开,所有点查询也都已更新为从 Vitess 集群读取。在第二步中,我们的主要目标是揭示新查询返回的数据集中的任何差异。我们可以轻松地并排比较这两个集合,因为我们同时针对新旧集群运行查询,并在遇到差异时记录它们(使用现有查询对旧数据源的结果作为事实来源)。我们通过多种方式汇总了差异,以便我们可以大致了解我们需要解决的问题的范围,此外,每当一对返回不同的结果时,我们都会记录主键。
Given our configuration changes, we were comfortable ramping up 100 percent of the traffic to the Vitess cluster without the risk of affecting site-wide performance. At this point, nearly all JOINs were detangled, with all point queries updated to read from the Vitess cluster as well. During this second step, our primary goal was to reveal any discrepancies in the data sets returned from the new queries. We could easily compare the two sets side by side because we concurrently ran our queries against both the new and old clusters and logged diffs as we encountered them (using results from the existing query against our legacy data source as the source of truth). We aggregated discrepancies in a number of ways so that we could get a broad sense of the scope of the problems we needed to address, in addition to logging primary keys whenever a pair returned different results.
我们在这个阶段花了几个星期,仔细梳理了差异。由于我们的用户分片模式比原始的工作区分片channels_members表包含更多信息,因此我们在重写过程中处理的变量比我们原本可能处理的变量要多得多。我们试图改善使用共享通道和企业网格的工程师的开发体验,这要求我们在迁移每个查询时仔细考虑棘手的产品逻辑。这意味着出错的可能性比我们进行一对一迁移要大得多(到目前为止,我们移动到 Vitess 的每个表都是这种情况)。
We spent a few weeks in this phase, meticulously combing through the diffs. Because our user-sharded schema incorporated more information than the original, workspace-sharded channels_members table, we were juggling many more variables during the rewrite process than we might have otherwise. We sought to improve the developer experience for engineers working with shared channels and Enterprise Grid, requiring us to consider tricky product logic thoughtfully with each query we migrated. This meant that the potential for mistakes was much greater than had we done a one-to-one migration (as was the case with every table we had moved to Vitess to date).
数据集中的大部分差异都是由于单个问题造成的;修复单个实例通常会导致记录的差异量大幅减少。例如,如果我们在旧系统上选择的列集与 Vitess 上的列集不同,则每个查询都会返回不匹配的结果,并记录差异。正如我们在图 11-5中所报告的那样,查找和修复差异以忽略不匹配的列会将针对通道分片表记录的差异数量从所有查询的 10% 减少到仅 0.01%。
Large portions of the differences in the data sets were due to single problems; fixing a single instance would often lead to a large reduction in the volume of diffs logged. For example, if on the legacy system we were selecting a different set of columns than on Vitess, every query would return mismatched results, logging a diff. As we reported on in Figure 11-5, finding and fixing the discrepancies to ignore mismatched columns decreased the number of diffs logged against the channel-sharded table from 10 percent of all queries to just 0.01 percent.
channels_members表上的差异以下是图 11-5中的 Slack 消息图表的特写:
Here’s a close-up of the graph in Figure 11-5’s Slack message:
可惜的是,并非所有的差异都如此容易修复。通过查看数据集中的差异,我们发现了一些共享通道逻辑不太正确的地方,还有一些我们在补全过程中犯了错误的地方。这是一项繁琐的工作,而且由于涉及产品,通常需要对应用程序的内部工作原理有深刻的理解。尽管我们的操作隐藏在功能标记和实验之间,但我们所做的更改对我们的生产系统产生了真正的影响,因此我们必须非常谨慎地进行。考虑到这些因素以及项目进展缓慢的事实,我们要求产品工程部门提供更多资源。
Alas, not all diffs were as easy to fix. Reading through the differences in data sets, we uncovered a few spots where our logic for shared channels was not quite right, and a few others where we had made mistakes in our backfill. It was tedious work and, due to the product implications, oftentimes required a profound understanding of the inner workings of our application. Although our manipulations were hidden between feature flags and experiments, the changes we were making had real ramifications for our production systems, and we had to proceed with real caution. Given these factors and the fact that the project was progressing at a slow crawl, we asked for more resources from product engineering.
新人的加入为项目带来了新的活力。我们这些已经参与了好几个月的人都渴望从新的角度看待我们面临的许多问题。我们使用结对来快速培养新工程师,
齐心协力调试一小部分数据差异。这是一个完美的环境,可以展示 Vitess 迁移工具和分阶段推出的过程,并讨论新的模式。这项工作很繁琐,但有了更多的工程师,我们设法大大提升了我们的势头,消除了最后的几个差异。我们没有达到零差异,但对 99.999% 的正确率感到满意。因为我们知道channels_members当用户在 Slack 中阅读消息并移动光标状态时,每一行都可能非常快速地发生变化last_read,所以我们对一定程度的差异感到满意,这些差异可能归因于快速的读写情况。深入研究剩余的
0.001% 的差异,当我们在发生差异后直接在数据库中检查行时,我们注意到行会收敛到相同的状态。
Bringing new folks onboard brought new life to the project. Those of us who had been involved for many months were eager to get new perspectives on the many problems we’d been facing. We used pairing to ramp up new engineers quickly,
joining forces to debug a small set of data discrepancies. It was the perfect context from which to demonstrate the Vitess migration tooling and the phased rollout process, and to talk through the new schemas. The work was tedious, but with a bigger arsenal of engineers at our disposal, we managed to boost our momentum drastically and banish the final few discrepancies. We did not get down to zero diffs, but settled for feeling good at 99.999 percent correctness. Since we knew that each channels_members row could change quite rapidly as a user read messages in Slack, moving their last_read cursor state, we felt comfortable with some amount of discrepency that could be attributed to rapid read-after-write situations. Digging into the remaining
.001 percent of differences, when we examined rows directly in the database after a diff occurred, we noticed that the rows would converge to the same state.
完成Dark阶段意义重大。确保 100% 的channels_members流量可以在 Vitess 上以高性能方式运行并返回正确结果对于重构的整体成功至关重要。虽然我们还没有完全完成,但能够结束Dark模式让每个人都松了一口气。最后,我们准备为公司内一小部分测试用户升级到Light模式。
Wrapping up the Dark phase was significant. Knowing that 100 percent of channels_members traffic could run in a performant way on Vitess and return correct results was absolutely crucial to the overall success of the refactor. Although we weren’t quite finished yet, being able to close the book on Dark mode was a relief to everyone. Finally, we were ready to ramp up to Light mode for a small subset of beta users within the company.
在轻量模式下,我们想测试从 Vitess 集群执行查询中检索到的数据,以确保将流量交换到新表不会引入任何面向用户的回归。我们相当有信心,错误会相对较少,这在很大程度上是因为在前几个阶段已经完成了解决数据差异的工作。然而,由于频道成员资格是 Slack 的核心,如果有任何错误,它们的风险都相当严重。因此,我们谨慎地开始了轻量模式的升级,从 Slack 的一小群志愿者开始,最终目标是将其推广到我们的整个客户群。
During Light mode, we wanted to test-drive the data retrieved from executing queries against the Vitess cluster, certifying that swapping over traffic to our new tables would not introduce any user-facing regressions. We were fairly confident that there would be relatively few bugs, in great part because of the work completed during the previous phases to address data discrepancies. However, because channel membership is at the core of Slack, if there were any bugs at all, they risked being quite serious. So we started our Light mode ramp-up carefully, starting off with a small group of volunteers at Slack, with the eventual goal to enable it to our entire customer base.
大多数情况下,一切都运行正常,但我们很快就遇到了一个问题:有时,在加入频道后,用户无法发送消息。我们立即停止实验,并深入研究查询日志,我们将这些日志保存在所有数据库主机上长达两个小时。这些日志使我们能够轻松调试,查找对给定频道中用户成员资格行的任何修改以及负责这些修改的调用者。
Most things worked fine, but we quickly ran into a problem when sometimes, after joining a channel, users would be unable to send messages. We immediately ramped down the experiment and dug into query logs, which we kept on all database hosts for up to two hours. These logs allowed us to debug easily, grepping for any modifications to the user’s membership row in the given channel and the callers responsible for them.
我们很快就找到了罪魁祸首:一个后台进程,当任何 Grid 用户加入他们之前所属的工作区级频道后,该进程就会触发,它会查找具有规范用户 ID 的成员资格行并将其替换为用户的本地用户 ID。这是一个问题,因为我们在 Vitess 中的新数据库模式有意使用了规范用户 ID;在该进程重写了用户 ID 后,我们无法再找到用户的会员行,从而阻止他们发送 消息。
We quickly identified the culprit: a background process, triggered after any Grid user joined a workspace-level channel they’d previously been a member of, which would locate and replace membership rows that had a canonical user ID with the user’s local user ID. This was a problem because our new database schema in Vitess intentionally used canonical user IDs; after the process had rewritten the user ID, we could no longer locate the user’s membership row, thereby preventing them from sending messages.
我们很困惑为什么会出现这种流程,也很想知道我们是否需要保留这种奇怪的行为,或者发现了一个更严重的问题。查看几年前的 Slack 对话和 git 历史记录后发现,代码是为了解决 Enterprise Grid 功能特有的问题而编写的,我们有时会使用规范用户 ID 编写占位符成员资格行,并在用户重新加入这些频道后更新它们。
We were puzzled about why this process existed and curious to understand whether we needed to preserve this strange behavior or had uncovered a more nefarious problem. A journey into Slack conversations and git history from years prior revealed that the code was written to paper over a problem specific to an Enterprise Grid feature, where we sometimes wrote placeholder membership rows with canonical user IDs and updated them once users rejoined those channels.
这个问题并没有出现在我们在暗黑模式阶段检查到的差异中,也没有出现在几轮手动质量保证 (QA) 和我们编写的单元测试中,因为它只在精确、极不常见的情况下出现。幸运的是,我们确定不再需要这个过程,并将其完全删除。问题解决了!
This issue did not manifest itself in the discrepancies we inspected during the Dark mode phase, nor did it appear during several rounds of manual quality assurance (QA) and in the unit tests we wrote, because it only arose under precise, highly uncommon circumstances. Fortunately, we determined that we no longer needed this process and deleted it entirely. Problem solved!
从开始到结束,我们花了一个月的时间向所有客户推广浅色模式。在我们通过一小群志愿者对 Vitess 集群中数据的整体正确性有了信心之后,我们继续推广。我们从自己的 Slack 实例开始,然后是免费层的团队,然后是付费客户,最后是我们最大的企业客户。在推广过程中,我们注意到共享频道数量最多的客户在查看频道时调用的 API 超时了 ( conversations.view)。我们很快注意到 API 调用期间执行的一个 Vitesschannels_members查询超时了。不幸的是,由于查询量相对较低,我们在深色模式阶段没有收到有关该问题的警报。我们立即为客户回滚了浅色模式,修复了查询,然后重新开始推广。
From start to finish, we spent one month ramping up Light mode to all customers. Once we’d gained confidence in the overall correctness of the data in the Vitess cluster with our small set of volunteers, we continued the ramp-up. We began with our own Slack instance and then went on to teams on the free tier, followed by paying customers, and finally our largest Enterprise customers. During the ramp-up, we noticed that our customer with the greatest number of shared channels was seeing time-outs on the API called when viewing a channel (conversations.view). We quickly noticed that one of the Vitess channels_members queries executed during the API call was timing out. Unfortunately, because the query was relatively low volume, we hadn’t been alerted to the problem during the Dark mode phase. We immediately rolled back Light mode for the customer, fixed the query, and ramped right back up.
在成功让所有客户选择轻量模式仅仅三天后,我们就开始进入最后阶段,即日落模式。在此阶段,尽管我们继续对两个数据源进行双重写入,但我们只将读取流量路由到新的 Vitess 集群。通过为用户启用日落模式,我们将超载遗留系统的查询负载减少了 22%,为他们提供了急需的喘息空间。图 11-6显示了我们在工作区分片中观察到的查询量下降。
A mere three days after successfully opting all customers into Light mode, we began the final stage, Sunset mode. During this phase, although we continued to double-write to both data sources, we only routed read traffic to the new Vitess clusters. By enabling Sunset mode to our users, we decreased the query load on our overloaded legacy systems by 22 percent, giving them much-needed breathing room. Figure 11-6 shows the dip in query volume we observed across our workspace shards.
日落模式结束后,还有一些重要任务需要完成。也就是说,一旦我们的数据仓库依赖项正确迁移到使用来自 Vitess 的渠道成员资格数据,我们就需要删除旧的工作区分片channels_members表。大约一个月后,我们与它们告别。然后,我们花了接下来的几周时间整理渠道成员资格统一数据库,仔细解除所有功能标记并删除双重写入逻辑。
After Sunset mode, a handful of important tasks remained. Namely, once our data warehouse dependencies had been properly migrated to consume channel membership data from Vitess, we needed to drop the old workspace-sharded channels_members tables. We bade them farewell roughly a month later. We then spent the following weeks tidying the channel membership unidata library, carefully unwinding any feature flags and removing double-writing logic.
删除旧分片的写入操作是一个巨大的及时胜利。我们删除了 50% 的写入操作,并完全消除了我们最大客户(第 10 章中的 VLB )的企业分片上的复制延迟,而就在它开始在持续不断的写入流量压力下挣扎时。在删除表之前的几天里,该分片的复制延迟一直超过 20 分钟。图 11-7显示了 VLB 企业分片的写入流量急剧下降。
Dropping writes from the legacy shards was a huge, timely win. We removed 50 percent of writes and completely eliminated replication lag on the enterprise shard for our largest customer (VLB from Chapter 10), just as it was beginning to struggle under the pressure of the incessant write traffic. In the days leading up to dropping the table, the shard had been experiencing replication lag upward of 20 minutes. Figure 11-7 shows the steep drop in write traffic to VLB’s enterprise shard.
图 11-8表明,删除写入负载后,复制滞后明显没有出现峰值。
Figure 11-8 shows a distinct lack of spikes in replication lag following the removal of the write load.
以下是图 11-8中的 Slack 消息图表的特写:
Here’s a close-up of the graph in Figure 11-8’s Slack message:
不幸的是,就在我们即将完成时,冠状病毒开始蔓延,我们在世界各地的办公室关闭,Slack 的全体员工都转为在家办公。随着全球转向远程办公,Slack 的需求急剧增加;我们以惊人的速度吸引新客户,现有客户发送的消息比以往任何时候都多。整个基础设施团队,包括我们这些即将完成迁移的人channels_members
,都紧急将重点转移到将我们的系统扩展到前所未有的水平。虽然我们很高兴能够完成重构,但我们从未得到适当的机会来庆祝我们的成就。
Unfortunately, just as we were finishing up, the coronavirus was beginning to spread, and our offices around the world shut down, with Slack’s entire workforce transitioning to working from home. With the global shift to remote work, Slack saw a sharp increase in demand; we were acquiring new customers at a breakneck pace, and our existing customers were sending more messages than ever before. The entire infrastructure team, including those of us winding down the channels_members
migration, urgently shifted their focus to scaling our systems to unprecendented levels. Although we were relieved to bring the refactor to a close, we were never given the proper opportunity to revel in our achievement.
随着这个项目的结束,Slack 的其他工程师开始策划如何利用新分片的表。很快,新功能的原型就开始出现,即使我们处于 SUNSET 模式,许多后续项目也迅速在多个团队中配备人员,以利用新的数据模型并简化网格和共享通道周围的其他查询。
With this project at a close, other engineers at Slack started scheming about ways to take advantage of the newly resharded table. Quickly, prototypes of new features started emerging even when we were in SUNSET mode, and many following projects were staffed on multiple teams quickly to take advantage of the new data model and simplify other queries around both Grid and shared channels.
与我们之前的案例研究一样,从迁移到 Vitess 的过程中,我们可以学到许多重要的经验教训channels_members。我们将从项目可能进展得更好的方式开始,描述我们如何设定更切合实际的估算并尽早找到合适的队友。然后,我们将讨论成功的方法,详细说明我们在开始时谨慎扩大项目范围的决定以及我们简单沟通策略的优点。
As with our previous case study, there are a number of important lessons to be learned from our migration of channels_members to Vitess. We’ll start with ways the project might have gone better, describing how we might have set more realistic estimates and sourced the right teammates sooner. Then we’ll discuss ways it succeeded, detailing our decision to increase project scope carefully at the outset and the merits of our simple communication strategy.
当我们开始将表迁移channels_members到 Vitess 时,我们已经完成了多次 Vitess 迁移。我们构建并改进了工具来改进流程,使每次迭代都更加轻松和安全。我们根据最近几次迁移的经验进行了初步估算,这些迁移的速度明显比前几次要快。我们乐观地认为,这次迁移不会比上一次更困难。
By the time we started our migration of the channels_members table to Vitess, we had done a number of Vitess migrations already. We had built and refined tooling to improve the process, making it easier and safer with every iteration. We based our initial estimates on our experience with our most recent migrations, which had been decidedly quicker than the first few. We optimistically assumed that this migration would be no more difficult than the last.
然而,我们本应知道,channels_members由于多种原因,这将是一个不同的难题。首先,查询负载远远超过了我们之前的任何迁移。其次,我们决定将数据分片到两个键(用户和渠道),而不是一个。最后,我们选择使用规范的用户 ID 并对架构进行有意义的更改,以提高开发人员的工作效率,从而进一步增加了项目的复杂性。我们的估算应该反映出这些重要的决定及其影响。
We should have known, however, that channels_members would be a different beast for a number of reasons. First, the query load far exceeded any of our previous migrations. Second, we decided to shard the data across two keys, user and channel, rather than just one. Finally, we chose to use canonical user IDs and make meaningful changes to the schema to improve developer productivity, thereby further increasing the complexity of the project. Our estimates should have reflected these important decisions and their implications.
当我们超出最初预期时,团队士气大受打击,工程领导层对项目更加谨慎。幸运的是,我们能够获得更多资源并推进重构,但我们的估计显然没有达到最初的预期。
The team took a morale hit when we surpassed our original estimate, and engineering leadership turned a more watchful eye on the project. Fortunately, we were able to secure more resources and move forward with the refactor, but our estimate certainly did not set the expectations it should have at the start.
设定不切实际的估计可能会带来更严重的后果:重构可能会失去优先权,工程领导层可能会对你推动大型软件项目的能力失去信心。你的职业生涯可能会受到打击。如果我们花时间集思广益,找出每个潜在的陷阱,并依靠第 4 章中讨论的策略,我们可能会在重构开始时为自己和利益相关者设定更好的期望。
Setting unrealistic estimates can have much more serious consequences: the refactor might lose priority, and engineering leadership might lose faith in your ability to drive large software projects. Your career risks taking a hit. Had we taken the time to brainstorm each of the potential pitfalls and leaned on the strategies discussed in Chapter 4, we might have set better expectations for both ourselves and our stakeholders at the start of the refactor.
当我们开始这个项目时,我们假设大部分工作最好由基础设施工程师来处理。我们可以根据需要联系产品工程师,提出问题或临时寻求代码审查。只有在我们遇到解开 JOIN 的困难时,我们才要求产品工程提供更重要的资源。正是在那时,我们意识到,与熟悉我们正在迁移的查询的工程师一起工作可以更快地完成工作。他们的参与在漫长的黑暗模式阶段至关重要,在此期间,我们调试了许多导致产品出现奇怪行为的数据差异。如果他们从一开始就更多地参与其中,我们可能会更快、更正确地迁移查询(包括 JOIN),从而减少后期阶段所花费的时间。
When we started the project, we assumed that the majority of the work would be best handled by infrastructure engineers. We could reach out to product engineers as necessary, asking questions or seeking code review on an ad hoc basis. Only once we ran into difficulties detangling the JOINs did we ask for more significant resourcing from product engineering. It was at that point that we realized that we could work faster by working alongside engineers who were intimately familiar with the queries we were migrating. Their involvement was crucial throughout the lengthy Dark mode phase, during which we debugged a number of data discrepancies that led to strange behaviors in the product. Had they been more present from the beginning, we might have migrated queries more quickly and more correctly (including the JOINs), cutting down on the time spent in later phases.
正如第 5 章所讨论的,有时你的队友并不是最适合这项工作的人。由于大规模重构影响深远,因此它们通常涉及来自不同团队和学科的工程师。你在项目开始时确定的团队很少是固定的。如果你认为你的团队不再合适,找出缺少谁并寻找这些人。如果你认为你需要的资源比你最初预期的要多,那就向他们索要。
As discussed in Chapter 5, sometimes the teammates you have are not the ones best suited for the job. Because large-scale refactors have far-reaching impact, they often involve engineers from different teams and disciplines. The team you identify at the start of your project is very rarely set in stone. If you believe your team is no longer the right one, figure out who it is missing and seek out those individuals. If you think you need more resources than you had initially anticipated, ask for them.
我们在重构初期做出的一个重要决定是,对 Vitesschannels_members模式中所有与用户 ID 相关的列使用规范用户 ID。我们知道 Slack 的目标是始终采用规范用户 ID,但在我们的表迁移完成之前,项目的前几个阶段不太可能结束。
An important decision we made early in the refactor was to use canonical user IDs for all user ID–related columns in the Vitess channels_members schemas. We knew that Slack was aiming to adopt canonical user IDs throughout, but the first few phases of the project were unlikely to conclude before our table migration was complete.
通过选择采用规范的用户 ID,我们有意扩大了重构的范围。我们本可以先花时间在旧式工作区分片集群上规范化用户 ID,等到数据正确更新后再迁移到 Vitess。同样,我们可以在不规范化 ID 的情况下迁移表,并在 ID 安全进入 Vitess 后启动该过程。我们相信,通过同时进行这两项工作,我们可以节省时间和精力。(虽然我们没有很好的方法来衡量这一点,但我们确实相信这是真的!)
By choosing to adopt canonical user IDs, we intentionally increased the scope of the refactor. We could have spent the time canonicalizing user IDs on our legacy workspace-sharded clusters first, only migrating to Vitess once the data had been properly updated. Likewise, we could have migrated the table without canonicalizing the IDs and initiated the process once it had safely landed in Vitess. We believed that by doing both at the same time, we would save both time and effort. (While we had no great way of measuring this, we do believe it turned out to be true!)
在第 4 章中,我们了解到保持适度的范围很重要,以确保重构在合理的时间内完成,并且不会影响到不必要的范围。但是,在某些情况下,增加一些额外的范围是值得的,最终会使工作更加成功。在项目规划阶段要注意这些机会,并在项目全面展开之前做出明智的决定,充分利用它们。这样,当你更广泛地传达你的计划时,利益相关者将有机会对额外的范围发表意见,并且每个人的期望都应该得到适当的调整。
In Chapter 4, we learned that keeping a moderate scope is important to ensure that a refactor is completed within a reasonable amount of time and does not affect more surface area than is necessary. However, there are circumstances when adding some additional scope is worthwhile and will ultimately make the effort more successful. Be mindful of these opportunities during the project planning stage and make a deliberate decision to take advantage of them well before the project is in full swing. This way, when you communicate your plan more broadly, stakeholders will have an opportunity to voice an opinion about the additional scope, and everyone’s expectations should be appropriately aligned.
在整个重构过程中,我们严重依赖我们的项目频道 #feat-vitess-channels 进行协作、协调并提供重要更新。由于它是我们的中心联络点,每个人都能及时了解新消息。这是一个提问或发布代码以供审查的好地方;你肯定会在几分钟内得到回复。有几次,队友会在线程中调试问题,以便其他人加入或稍后跟进。在重构的Light模式部分,自愿选择加入新查询的用户会来到 #feat-vitess-channels 报告他们遇到的错误和其他奇怪行为。如果它与迁移到 Vitess 有关channels_members,你可以在这个频道中找到它。
Throughout the refactor, we leaned heavily on our project channel, #feat-vitess-channels, to collaborate, coordinate, and provide important updates. Because it served as our central point of contact, everyone kept up to date with new messages. It was a great place to ask questions or post code for review; you were sure to get a response within a few minutes. On several occasions, teammates would debug issues in threads for others to chime in or catch up on later. During the Light mode portion of the refactor, users who had volunteered to be opted in to the new queries would come to #feat-vitess-channels to report bugs and other strange behavior they’d encountered. If it was related to moving channels_members to Vitess, you could find it in this channel.
最重要的是,#feat-vitess-channels 是我们互相激励的地方。随着重构的拖延,工程师们不断切换,黑暗模式不断给我们带来许多难题,我们越来越难以对进展保持乐观。公司各地的工程师偶尔会来这里鼓励大家说“你做到了!”,或者对每周的状态更新做出一系列表情符号反应。小而周到的支持行为可以大大提升团队士气,而有一个方便同事分享鼓励的地方有助于让这种情况成为一种常见现象。
Most importantly, #feat-vitess-channels was a place for us to keep each other motivated. As the refactor dragged on, with engineers cycling on and off and Dark mode continuing to throw us a number of curveballs, it became increasingly difficult to stay optimistic about our progress. Engineers from across the company would occasionally pop in with an encouraging “You got this!” or a series of emoji reactions to a weekly status update. Small, thoughtful acts of support can go a long way to boost team morale, and having a convenient place where colleagues could share their encouragement helped make it a common occurrence.
通过将与项目有关的所有沟通集中在一个地方,参与重构的每个人都可以轻松地保持一致。团队成员可以加入或离开工作,而无需进行广泛的知识传递。外部利益相关者可以查看最新进展,而无需直接联系您。也许最重要的是,它可以成为一个支持和鼓励的地方。有关如何建立良好沟通习惯的想法,请参阅第 7 章。
By keeping all communication pertaining to the project in a single place, it’s easy for everyone involved with the refactor to stay on the same page. Teammates can join and leave the effort without extensive knowledge transfers. External stakeholders can check in on the latest progress without pinging you directly. Perhaps most importantly, it can be a place of support and encouragement. For ideas on how to establish good communication habits, refer to Chapter 7.
将频道会员表迁移到 Vitess 有一个明确的推出策略,分为四个具体阶段。在每个阶段,我们都对何时应该选择不同的用户群体进行更改有着清晰的愿景(即首先是公司用户,然后是免费套餐客户,常规付费客户,最后是我们最大的客户)。在此过程之上,我们使用了专为 Vitess 迁移用例构建的高度可靠的工具,这使我们能够按照我们喜欢的速度快速将每种不同的模式提升(和降低)到不同的用户群体。
The migration of the channel membership table to Vitess had a well-defined rollout strategy split into four concrete phases. At each stage, we had a strong vision of when we should opt different groups of users into our changes (i.e., users at the company first, followed by customers on the free tier, regular paid customers, and our largest customers last). On top of this procedure, we used highly reliable tooling built explicitly for the Vitess migration use case, which enabled us to quickly ramp up (and down) each of the different modes to distinct slices of users at our preferred pace.
这些因素中的每一个都帮助我们快速前进,但也许最有效的一点是,如果我们开始注意到对用户造成不利影响,我们能够立即回滚。拥有这种能力意味着我们不怕积极前进。当我们进入轻量模式阶段时,这一点特别有用,因为我们使用公司内部的志愿者从 Vitess 集群读取数据。
Each of these factors helped us move forward quickly, but perhaps the most effective piece was our ability to roll back immediately if we began to notice a detrimental impact to our users. Having that power at our fingertips meant that we weren’t afraid to move forward aggressively. It was particularly useful when we entered the Light mode phase as we used volunteers within the company to read data from the Vitess cluster.
即使是最周到的计划和最细致的重构也会导致一些错误,而且在开始推出之前通常不可能识别出所有错误。如果你能在重要的里程碑上控制谁可以选择你的变更,并能迅速回滚,你将能够更灵活地取得进展,在潜在的严重回归成为严重事件之前就将其暴露出来。
Even the most thoughtfully planned, meticulously executed refactor will lead to a handful of bugs, and it is often impossible to identify them all before beginning a rollout. If you can control who is opted in to your changes at important milestones, and can roll back swiftly, you’ll be able to make progress much more nimbly, surfacing potentially terrible regressions well before they become a serious incident.
channels_members以下是我们从重构到从工作区分片集群迁移到 Vitess 中的用户和通道分片集群的最重要的要点。
Here are the most important takeaways from our refactor to migrate channels_members from our workspace-sharded clusters to user- and channel-sharded clusters in Vitess.
设定切合实际的估计。乐观是好事,但错过最后期限可能会带来严重后果。
Set realistic estimates. Optimism is great, but missed deadlines can have serious ramifications.
寻找您需要的队友;您身边或团队中现有的队友可能不是最适合这份工作的人。如果您需要新资源(或更多资源),不要害怕要求。
Source the teammates you need; the ones available to you or currently on your team may not be the ones best suited for the job. Don’t be afraid to ask for new (or more) resources if you need them.
仔细规划项目范围。在规划阶段应考虑任何增加的范围,以适当地设定期望值。
Plan project scope carefully. Any added scope should be accounted for during the planning phase to set expectations appropriately.
选择一个单一的地方进行项目沟通并坚持下去。
Choose a single place for project communication and stick to it.
Design a thoughtful rollout plan and invest in building the tooling you need to make ramp up (and down) as easy as possible.
《大规模重构》封面上的动物是海象(Odobenus rosmarus),它们是一种生活在北极和北极亚北极地区的大型海洋哺乳动物。
The animals on the cover of Refactoring at Scale are walruses (Odobenus rosmarus), large marine mammals found in the Arctic and subarctic regions of the North Pole.
海象以其长而锋利的獠牙而闻名,獠牙可以帮助它们破冰、爬出水面、在群体中占据主导地位以及抵御掠食者的攻击。海象厚厚的皮肤上稀疏地覆盖着短毛,颜色从灰色到黄褐色不等。厚得多的鲸脂层提供温暖和储存的能量,使它们能够在恶劣的条件下生存。
Walruses are well known for their long, sharp tusks that aid them in breaking ice, climbing out of the water, establishing dominance in a herd, and defending themselves from predators. Short fur sparsely covers the walrus’s thick skin, which ranges in color from gray to a yellow-brown. A much thicker layer of blubber provides warmth and stored energy, allowing them to survive in harsh conditions.
这些行动缓慢的食肉动物喜欢生活在冰层和浅水中,以便于获取食物,并会季节性迁徙以寻找最佳厚度的冰层。短前鳍和较大的后鳍推动着这种平均重达一吨的生物在水中前行,而它的胡须比眼睛更能用于导航和识别食物。海象主要以大量软体动物和其他贝类为食,但有时也会吃较大的动物,如海鸟甚至海豹。
These slow-moving carnivores prefer to live in areas of ice and shallow water to allow easy access to food and will migrate seasonally to find ice of optimal thickness. Short front flippers and larger hind flippers propel this one-ton (on average) creature through the water, while its whiskers, more so than its eyes, are used for navigation and food identification. Walruses mostly consume large amounts of mollusks and other shellfish, but have been known to occasionally eat larger animals such as seabirds and even seals.
全球气候变化和人类掠夺导致海象的保护状况被列为“易危”。O'Reilly 封面上的许多动物都濒临灭绝;它们对世界都很重要。
Global climate change and human predation have caused the walrus’s conservation status to be listed as Vulnerable. Many of the animals on O’Reilly covers are endangered; all of them are important to the world.
封面插图由 Karen Montgomery 绘制,基于 Vogt & Specht 所著《动物自然史》的黑白版画。封面字体为 Gilroy Semibold 和 Guardian Sans。文本字体为 Adobe Minion Pro;标题字体为 Adobe Myriad Condensed;代码字体为 Dalton Maag 的 Ubuntu Mono。
The cover illustration is by Karen Montgomery, based on a black and white engraving from Natural History of Animals by Vogt & Specht. The cover fonts are Gilroy Semibold and Guardian Sans. The text font is Adobe Minion Pro; the heading font is Adobe Myriad Condensed; and the code font is Dalton Maag’s Ubuntu Mono.